[WIP] Allow TruncatedSVD using randomized to automatically reset k < n_features. #17949

zachmayer · 2020-07-17T17:08:32Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Allows TruncatedSVD using the randomized algorithm to automatically reset k < n_features.

Any other comments?

NicolasHug

Thanks for the PR @zachmayer

If we want to raise warnings instead of errors I think we should do it for all transformers that have a n_components attribute. Also I'm not sure we need a new warning class, a UserWarning should be enough.

NicolasHug · 2020-07-17T17:26:35Z

Also please update the PR title to something more descriptive ;)
You can prefix as [WIP] or [MRG] when ready for reviews

zachmayer · 2020-07-17T17:31:14Z

What's [MRG] mean?

What other transformers have a n_components attribute?

NicolasHug · 2020-07-17T17:35:58Z

WIP = work in progress
MRG stands for merge. You'll find more details in our contributing guidelines, please check them out.

For transformers, you can check git grep -l "self.n_components"

zachmayer · 2020-07-17T17:41:27Z

Right right, I keep forgetting about git grep

Mayer-2:scikit-learn zach2$ git grep -l "self.n_components"
benchmarks/bench_plot_nmf.py
doc/developers/develop.rst
sklearn/cluster/_bicluster.py
sklearn/cluster/_spectral.py
sklearn/cross_decomposition/_pls.py
sklearn/decomposition/_base.py
sklearn/decomposition/_dict_learning.py
sklearn/decomposition/_factor_analysis.py
sklearn/decomposition/_fastica.py
sklearn/decomposition/_incremental_pca.py
sklearn/decomposition/_kernel_pca.py
sklearn/decomposition/_lda.py
sklearn/decomposition/_nmf.py
sklearn/decomposition/_pca.py
sklearn/decomposition/_sparse_pca.py
sklearn/decomposition/_truncated_svd.py
sklearn/discriminant_analysis.py
sklearn/kernel_approximation.py
sklearn/manifold/_isomap.py
sklearn/manifold/_locally_linear.py
sklearn/manifold/_mds.py
sklearn/manifold/_spectral_embedding.py
sklearn/manifold/_t_sne.py
sklearn/mixture/_base.py
sklearn/mixture/_bayesian_mixture.py
sklearn/mixture/_gaussian_mixture.py
sklearn/mixture/tests/test_gaussian_mixture.py
sklearn/neighbors/_nca.py
sklearn/neural_network/_rbm.py
sklearn/random_projection.py
sklearn/utils/tests/test_pprint.py

Should I focus on the ones in decomposition?

zachmayer · 2020-07-17T18:00:09Z

I got a buncha tests to fix too

zachmayer · 2020-07-17T18:45:16Z

This is a better list of files:

(sklearn-dev) Mayer-2:scikit-learn zach2$ git grep -i -e "ValueError" --and -e "n_components"
sklearn/cluster/_bicluster.py:            raise ValueError("Parameter n_components must be greater than 0,"
sklearn/cross_decomposition/_pls.py:            raise ValueError("Invalid number of components n_components=%d"
sklearn/decomposition/_incremental_pca.py:            raise ValueError("n_components=%r invalid for n_features=%d, need "
sklearn/decomposition/_incremental_pca.py:            raise ValueError("n_components=%r must be less or equal to "
sklearn/decomposition/_lda.py:            raise ValueError("Invalid 'n_components' parameter: %r"
sklearn/decomposition/_pca.py:                raise ValueError("n_components='mle' is only supported "
sklearn/decomposition/_pca.py:            raise ValueError("n_components=%r must be between 0 and "
sklearn/decomposition/_pca.py:                raise ValueError("n_components=%r must be of type int "
sklearn/decomposition/_pca.py:            raise ValueError("n_components=%r cannot be a string "
sklearn/decomposition/_pca.py:            raise ValueError("n_components=%r must be between 1 and "
sklearn/decomposition/_pca.py:            raise ValueError("n_components=%r must be of type int "
sklearn/decomposition/_pca.py:            raise ValueError("n_components=%r must be strictly less than "
sklearn/decomposition/tests/test_incremental_pca.py:        with pytest.raises(ValueError, match="n_components={} invalid"
sklearn/decomposition/tests/test_incremental_pca.py:    with pytest.raises(ValueError, match="n_components={} must be"
sklearn/decomposition/tests/test_pca.py:# arpack raises ValueError for n_components == min(n_samples,  n_features)
sklearn/decomposition/tests/test_pca.py:    with pytest.raises(ValueError, match="n_components='mle' is only "
sklearn/manifold/_t_sne.py:            raise ValueError("'n_components' should be inferior to 4 for the "
sklearn/manifold/tests/test_t_sne.py:    with pytest.raises(ValueError, match="'n_components' should be .*"):
sklearn/mixture/_base.py:        raise ValueError('Expected n_samples >= n_components '
sklearn/mixture/_base.py:            raise ValueError("Invalid value for 'n_components': %d "
sklearn/random_projection.py:        raise ValueError("n_components must be strictly positive, got %d" %
sklearn/random_projection.py:                raise ValueError("n_components must be greater than 0, got %s"

zachmayer · 2020-07-17T18:54:30Z

@NicolasHug

So I am going to change the following files to reset n_components, if n_components > x.shape[1]:

partial least squares: sklearn/cross_decomposition/_pls.py
incremental pca: sklearn/decomposition/_incremental_pca.py
pca: sklearn/decomposition/_pca.py
base class for mixtures: sklearn/mixture/_base.py

I'll also update tests for all those classes. Do you think those changes will be ok with all the other sklearn maintainers?

One of those files (_incremental_pca.py) also check that self.n_components <= n_samples. Should I also update that to reset self.n_components = n_samples in the case where self.n_components > n_samples?

Or is _incremental_pca.py perhaps code I should omit from this change?

zachmayer · 2021-09-07T20:15:16Z

@NicolasHug Do you think the changes I proposed above will be ok with all the other sklearn maintainers?

zachmayer added 2 commits July 16, 2020 11:12

warning

9f760c1

reset k

992025b

zachmayer mentioned this pull request Jul 17, 2020

TruncatedSVD option to reduce k instead of raising "n_components must be < n_features;" #17916

Open

github-actions bot added the module:decomposition label Jul 17, 2020

NicolasHug reviewed Jul 17, 2020

View reviewed changes

zachmayer changed the title ~~Zam/17916~~ Allow TruncatedSVD using randomized to automatically reset k < n_features. Jul 17, 2020

zachmayer added 2 commits July 17, 2020 14:36

UserWarning

a0ddf83

lint

6955437

lint

b23533f

Base automatically changed from master to main January 22, 2021 10:52

zachmayer changed the title ~~Allow TruncatedSVD using randomized to automatically reset k < n_features.~~ [WIP] Allow TruncatedSVD using randomized to automatically reset k < n_features. Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Allow TruncatedSVD using randomized to automatically reset k < n_features. #17949

[WIP] Allow TruncatedSVD using randomized to automatically reset k < n_features. #17949

zachmayer commented Jul 17, 2020

NicolasHug left a comment

NicolasHug commented Jul 17, 2020

zachmayer commented Jul 17, 2020

NicolasHug commented Jul 17, 2020

zachmayer commented Jul 17, 2020

zachmayer commented Jul 17, 2020

zachmayer commented Jul 17, 2020

zachmayer commented Jul 17, 2020 •

edited

zachmayer commented Sep 7, 2021

[WIP] Allow TruncatedSVD using randomized to automatically reset k < n_features. #17949

Are you sure you want to change the base?

[WIP] Allow TruncatedSVD using randomized to automatically reset k < n_features. #17949

Conversation

zachmayer commented Jul 17, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

NicolasHug left a comment

Choose a reason for hiding this comment

NicolasHug commented Jul 17, 2020

zachmayer commented Jul 17, 2020

NicolasHug commented Jul 17, 2020

zachmayer commented Jul 17, 2020

zachmayer commented Jul 17, 2020

zachmayer commented Jul 17, 2020

zachmayer commented Jul 17, 2020 • edited

zachmayer commented Sep 7, 2021

zachmayer commented Jul 17, 2020 •

edited