Skip to content

Remove unnecessary restriction on number of samples in IncrementalPCA #30224

Merged
adrinjalali merged 8 commits into
scikit-learn:mainfrom
ThomasGesseyJonesPX:correct_errors_in_incremental_pca
Nov 7, 2024
Merged

Remove unnecessary restriction on number of samples in IncrementalPCA #30224
adrinjalali merged 8 commits into
scikit-learn:mainfrom
ThomasGesseyJonesPX:correct_errors_in_incremental_pca

Conversation

@ThomasGesseyJonesPX

Copy link
Copy Markdown
Contributor

Currently when calling IncrementalPCA.partial_fit() the number of samples in X always has to be greater or equal to the number of PCA components due to this hardcoded check

elif not self.n_components <= n_samples:

However, this restriction actually only needs to apply to the first call to partial_fit(), which performs the initial SVD decomposition. Once this first call has established the initial $U$ and $\Sigma$, none of the algorithms steps (see original paper https://www.cs.toronto.edu/~dross/ivt/RossLimLinYang_ijcv.pdf) require any size restrictions on the number of samples in the subsequent batches passed to partial_fit. The current error message hence unnecessary restricts the usage of IncrementalPCA, in particular in online applications where data might arrive in batches of different sizes.

This PR, updates the aforementioned check to only apply to the first call of partial_fit and adds a non-regression test for this issue.

@github-actions

github-actions Bot commented Nov 5, 2024

Copy link
Copy Markdown

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: c752837. Link to the linter CI: here

@ogrisel ogrisel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Some minor feedback below, but otherwise LGTM.

We probably need better tests (e.g. on non-spherical data) for this estimator, and we could also clarify what are the expectations in terms of accuracy between the full-batch and incremental version of PCA, but this is beyond the scope of this PR.

Comment thread sklearn/decomposition/tests/test_incremental_pca.py Outdated
Comment thread sklearn/decomposition/tests/test_incremental_pca.py Outdated
Comment thread sklearn/decomposition/tests/test_incremental_pca.py Outdated
Comment thread sklearn/decomposition/_incremental_pca.py Outdated
@ThomasGesseyJonesPX

Copy link
Copy Markdown
Contributor Author

Thanks for the feedback! I have implemented all of your suggestions

@ThomasGesseyJonesPX ThomasGesseyJonesPX force-pushed the correct_errors_in_incremental_pca branch from d2d4f27 to c752837 Compare November 7, 2024 09:47
@adrinjalali adrinjalali merged commit 94f8875 into scikit-learn:main Nov 7, 2024
@ThomasGesseyJonesPX ThomasGesseyJonesPX deleted the correct_errors_in_incremental_pca branch November 7, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants