Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allow TruncatedSVD using randomized to automatically reset k < n_features. #17949

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

zachmayer
Copy link
Contributor

Reference Issues/PRs

Fixes #17916

What does this implement/fix? Explain your changes.

Allows TruncatedSVD using the randomized algorithm to automatically reset k < n_features.

Any other comments?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @zachmayer

If we want to raise warnings instead of errors I think we should do it for all transformers that have a n_components attribute. Also I'm not sure we need a new warning class, a UserWarning should be enough.

@NicolasHug
Copy link
Member

Also please update the PR title to something more descriptive ;)
You can prefix as [WIP] or [MRG] when ready for reviews

@zachmayer
Copy link
Contributor Author

What's [MRG] mean?

What other transformers have a n_components attribute?

@NicolasHug
Copy link
Member

WIP = work in progress
MRG stands for merge. You'll find more details in our contributing guidelines, please check them out.

For transformers, you can check git grep -l "self.n_components"

@zachmayer
Copy link
Contributor Author

Right right, I keep forgetting about git grep

Mayer-2:scikit-learn zach2$ git grep -l "self.n_components"
benchmarks/bench_plot_nmf.py
doc/developers/develop.rst
sklearn/cluster/_bicluster.py
sklearn/cluster/_spectral.py
sklearn/cross_decomposition/_pls.py
sklearn/decomposition/_base.py
sklearn/decomposition/_dict_learning.py
sklearn/decomposition/_factor_analysis.py
sklearn/decomposition/_fastica.py
sklearn/decomposition/_incremental_pca.py
sklearn/decomposition/_kernel_pca.py
sklearn/decomposition/_lda.py
sklearn/decomposition/_nmf.py
sklearn/decomposition/_pca.py
sklearn/decomposition/_sparse_pca.py
sklearn/decomposition/_truncated_svd.py
sklearn/discriminant_analysis.py
sklearn/kernel_approximation.py
sklearn/manifold/_isomap.py
sklearn/manifold/_locally_linear.py
sklearn/manifold/_mds.py
sklearn/manifold/_spectral_embedding.py
sklearn/manifold/_t_sne.py
sklearn/mixture/_base.py
sklearn/mixture/_bayesian_mixture.py
sklearn/mixture/_gaussian_mixture.py
sklearn/mixture/tests/test_gaussian_mixture.py
sklearn/neighbors/_nca.py
sklearn/neural_network/_rbm.py
sklearn/random_projection.py
sklearn/utils/tests/test_pprint.py

Should I focus on the ones in decomposition?

@zachmayer zachmayer changed the title Zam/17916 Allow TruncatedSVD using randomized to automatically reset k < n_features. Jul 17, 2020
@zachmayer
Copy link
Contributor Author

I got a buncha tests to fix too

@zachmayer
Copy link
Contributor Author

This is a better list of files:

(sklearn-dev) Mayer-2:scikit-learn zach2$ git grep -i -e "ValueError" --and -e "n_components"
sklearn/cluster/_bicluster.py:            raise ValueError("Parameter n_components must be greater than 0,"
sklearn/cross_decomposition/_pls.py:            raise ValueError("Invalid number of components n_components=%d"
sklearn/decomposition/_incremental_pca.py:            raise ValueError("n_components=%r invalid for n_features=%d, need "
sklearn/decomposition/_incremental_pca.py:            raise ValueError("n_components=%r must be less or equal to "
sklearn/decomposition/_lda.py:            raise ValueError("Invalid 'n_components' parameter: %r"
sklearn/decomposition/_pca.py:                raise ValueError("n_components='mle' is only supported "
sklearn/decomposition/_pca.py:            raise ValueError("n_components=%r must be between 0 and "
sklearn/decomposition/_pca.py:                raise ValueError("n_components=%r must be of type int "
sklearn/decomposition/_pca.py:            raise ValueError("n_components=%r cannot be a string "
sklearn/decomposition/_pca.py:            raise ValueError("n_components=%r must be between 1 and "
sklearn/decomposition/_pca.py:            raise ValueError("n_components=%r must be of type int "
sklearn/decomposition/_pca.py:            raise ValueError("n_components=%r must be strictly less than "
sklearn/decomposition/tests/test_incremental_pca.py:        with pytest.raises(ValueError, match="n_components={} invalid"
sklearn/decomposition/tests/test_incremental_pca.py:    with pytest.raises(ValueError, match="n_components={} must be"
sklearn/decomposition/tests/test_pca.py:# arpack raises ValueError for n_components == min(n_samples,  n_features)
sklearn/decomposition/tests/test_pca.py:    with pytest.raises(ValueError, match="n_components='mle' is only "
sklearn/manifold/_t_sne.py:            raise ValueError("'n_components' should be inferior to 4 for the "
sklearn/manifold/tests/test_t_sne.py:    with pytest.raises(ValueError, match="'n_components' should be .*"):
sklearn/mixture/_base.py:        raise ValueError('Expected n_samples >= n_components '
sklearn/mixture/_base.py:            raise ValueError("Invalid value for 'n_components': %d "
sklearn/random_projection.py:        raise ValueError("n_components must be strictly positive, got %d" %
sklearn/random_projection.py:                raise ValueError("n_components must be greater than 0, got %s"

@zachmayer
Copy link
Contributor Author

zachmayer commented Jul 17, 2020

@NicolasHug

So I am going to change the following files to reset n_components, if n_components > x.shape[1]:

  • partial least squares: sklearn/cross_decomposition/_pls.py
  • incremental pca: sklearn/decomposition/_incremental_pca.py
  • pca: sklearn/decomposition/_pca.py
  • base class for mixtures: sklearn/mixture/_base.py

I'll also update tests for all those classes. Do you think those changes will be ok with all the other sklearn maintainers?

One of those files (_incremental_pca.py) also check that self.n_components <= n_samples. Should I also update that to reset self.n_components = n_samples in the case where self.n_components > n_samples?

Or is _incremental_pca.py perhaps code I should omit from this change?

Base automatically changed from master to main January 22, 2021 10:52
@zachmayer
Copy link
Contributor Author

@NicolasHug Do you think the changes I proposed above will be ok with all the other sklearn maintainers?

@zachmayer zachmayer changed the title Allow TruncatedSVD using randomized to automatically reset k < n_features. [WIP] Allow TruncatedSVD using randomized to automatically reset k < n_features. Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TruncatedSVD option to reduce k instead of raising "n_components must be < n_features;"
2 participants