New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IncrementalPCA.partial_fit gives an uninformative error message and is possibly broken #6452

Closed
erensezener opened this Issue Feb 26, 2016 · 12 comments

Comments

Projects
None yet
6 participants
@erensezener

erensezener commented Feb 26, 2016

IncrementalPCA.partial_fit raises the following error when the number of samples is smaller than n_components:

  File "___/sklearn/decomposition/incremental_pca.py", line 218, in partial_fit
    (self.components_.shape[0], self.n_components_))
ValueError: Number of input features has changed from 10 to 20 between calls to partial_fit! 
Try setting n_components to a fixed value.

This error can be generated via the following snippet as long as n_samples < n_components:

import numpy as np
from sklearn.decomposition import IncrementalPCA

n_samples, n_features = 10, 50
ipca = IncrementalPCA(n_components=20)

for i in range(5):
    ipca.partial_fit(np.random.rand(n_samples, n_features))

The error message is not informative because the number of input features is constant (i.e., 50)
and n_components is fixed anyways.

Also, I don't see any fundamental reason why this shouldn't work. In regular PCA, number of components after transformation can be larger than the number of samples.
Is this a limitation of the incremental PCA algorithm or the implementation?

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Feb 26, 2016

Member

Is this a limitation of the incremental PCA algorithm or the implementation?

Both, sort of. In standard PCA any component beyond the number of samples is ill-determined (they all have eigenvalue zero, and so are completely degenerate) and thus are useless in almost all contexts. That's probably why that corner case wasn't accounted for in IncrementalPCA.

That said, you're absolutely right that this error should be more informative. I'll put together a quick PR.

Member

jakevdp commented Feb 26, 2016

Is this a limitation of the incremental PCA algorithm or the implementation?

Both, sort of. In standard PCA any component beyond the number of samples is ill-determined (they all have eigenvalue zero, and so are completely degenerate) and thus are useless in almost all contexts. That's probably why that corner case wasn't accounted for in IncrementalPCA.

That said, you're absolutely right that this error should be more informative. I'll put together a quick PR.

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Feb 26, 2016

Member

Looking at the code, I think we could make it support this case by using full_matrices=True rather than full_matrices=False in the SVD. But this would potentially be much more computationally costly without much (any?) gain in usefulness.

Member

jakevdp commented Feb 26, 2016

Looking at the code, I think we could make it support this case by using full_matrices=True rather than full_matrices=False in the SVD. But this would potentially be much more computationally costly without much (any?) gain in usefulness.

@jakevdp

This comment has been minimized.

Show comment
Hide comment
@jakevdp

jakevdp Feb 26, 2016

Member

Actually, I now see it's more complicated than that. The svd_flip function makes some assumptions about U and V that don't hold with full_matrices=True. Though I believe it's possible to make this corner case work correctly, I think it might be cleaner to simply raise a more useful error. Any thoughts?

Member

jakevdp commented Feb 26, 2016

Actually, I now see it's more complicated than that. The svd_flip function makes some assumptions about U and V that don't hold with full_matrices=True. Though I believe it's possible to make this corner case work correctly, I think it might be cleaner to simply raise a more useful error. Any thoughts?

@nelson-liu

This comment has been minimized.

Show comment
Hide comment
@nelson-liu

nelson-liu Feb 26, 2016

Contributor

Though I believe it's possible to make this corner case work correctly, I think it might be cleaner to simply raise a more useful error.

+1

Contributor

nelson-liu commented Feb 26, 2016

Though I believe it's possible to make this corner case work correctly, I think it might be cleaner to simply raise a more useful error.

+1

@erensezener

This comment has been minimized.

Show comment
Hide comment
@erensezener

erensezener Feb 26, 2016

Both, sort of. In standard PCA any component beyond the number of samples is ill-determined (they all have eigenvalue zero, and so are completely degenerate) and thus are useless in almost all contexts. That's probably why that corner case wasn't accounted for in IncrementalPCA.

Isn't the goal of incremental PCA being able to process the samples in batches?
So n_samples > n_components > batch_size should work, right?
But with the current setting it must be batch_size > n_components.

erensezener commented Feb 26, 2016

Both, sort of. In standard PCA any component beyond the number of samples is ill-determined (they all have eigenvalue zero, and so are completely degenerate) and thus are useless in almost all contexts. That's probably why that corner case wasn't accounted for in IncrementalPCA.

Isn't the goal of incremental PCA being able to process the samples in batches?
So n_samples > n_components > batch_size should work, right?
But with the current setting it must be batch_size > n_components.

@erensezener

This comment has been minimized.

Show comment
Hide comment
@erensezener

erensezener Feb 28, 2016

@jakevdp the variable names in the example I gave were not very accurate. But the following code should work in principle but it doesn't work if batch_size < n_components

import numpy as np
from sklearn.decomposition import IncrementalPCA

n_samples, n_features = 1000, 50
data = np.random.rand(n_samples, n_features)
batch_size = 10
number_of_batches = int(n_samples / batch_size) #1000/10 = 100
ipca = IncrementalPCA(n_components=20)

for i in range(number_of_batches):
    ipca.partial_fit(data[slice(i*batch_size,(i+1)*batch_size ), :])

erensezener commented Feb 28, 2016

@jakevdp the variable names in the example I gave were not very accurate. But the following code should work in principle but it doesn't work if batch_size < n_components

import numpy as np
from sklearn.decomposition import IncrementalPCA

n_samples, n_features = 1000, 50
data = np.random.rand(n_samples, n_features)
batch_size = 10
number_of_batches = int(n_samples / batch_size) #1000/10 = 100
ipca = IncrementalPCA(n_components=20)

for i in range(number_of_batches):
    ipca.partial_fit(data[slice(i*batch_size,(i+1)*batch_size ), :])

@amueller amueller added this to the 0.19 milestone Oct 8, 2016

@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Oct 8, 2016

Member

I agree, it should be possible to do this with n_samples > n_components > batch_size. It's a bit tricky from an API perspective though in the first step. Also, you don't know n_samples in the first step...
If you can store the components, I think you should be able to increase batch_size, too. So maybe it's not worth covering this edge case, and we should just raise an error and suggest increasing the batch_size.

Member

amueller commented Oct 8, 2016

I agree, it should be possible to do this with n_samples > n_components > batch_size. It's a bit tricky from an API perspective though in the first step. Also, you don't know n_samples in the first step...
If you can store the components, I think you should be able to increase batch_size, too. So maybe it's not worth covering this edge case, and we should just raise an error and suggest increasing the batch_size.

@wallygauze

This comment has been minimized.

Show comment
Hide comment
@wallygauze

wallygauze Jul 6, 2017

Contributor

The dominant opinion I see here is that raising an error would be enough, but because of the 'Enhancement' and 'Moderate' labels that were added at the end, I have to ask whether enabling n_samples > n_components > batch_size is still what is wanted for this 0.19 release.
Ok for just an error?

Contributor

wallygauze commented Jul 6, 2017

The dominant opinion I see here is that raising an error would be enough, but because of the 'Enhancement' and 'Moderate' labels that were added at the end, I have to ask whether enabling n_samples > n_components > batch_size is still what is wanted for this 0.19 release.
Ok for just an error?

@wallygauze

This comment has been minimized.

Show comment
Hide comment
@wallygauze

wallygauze Jul 14, 2017

Contributor

Can someone consider my pull-request in #9303? @amueller I would change it from WIP to MRG, but I would need the confirmation that an error is enough for the 0.19 release.

Contributor

wallygauze commented Jul 14, 2017

Can someone consider my pull-request in #9303? @amueller I would change it from WIP to MRG, but I would need the confirmation that an error is enough for the 0.19 release.

@wallygauze

This comment has been minimized.

Show comment
Hide comment
@wallygauze

wallygauze Jul 14, 2017

Contributor

Nevermind, I'll just change it to MRG for now so it is clear I have finished whatever work was needed for that.

Contributor

wallygauze commented Jul 14, 2017

Nevermind, I'll just change it to MRG for now so it is clear I have finished whatever work was needed for that.

@amueller amueller marked this as a duplicate of #9330 Jul 15, 2017

@lesteve

This comment has been minimized.

Show comment
Hide comment
@lesteve

lesteve Sep 13, 2017

Member

Fixed by #9303.

@wallygauze please shour if I got this wrong.

Small advice @wallygauze please use "Fix #issueNumber" in your PR description, this way the associated issue gets closed automatically when the PR is merged. For more details, look at this.

Member

lesteve commented Sep 13, 2017

Fixed by #9303.

@wallygauze please shour if I got this wrong.

Small advice @wallygauze please use "Fix #issueNumber" in your PR description, this way the associated issue gets closed automatically when the PR is merged. For more details, look at this.

@lesteve lesteve closed this Sep 13, 2017

@wallygauze

This comment has been minimized.

Show comment
Hide comment
@wallygauze

wallygauze Sep 13, 2017

Contributor

Nice, didn't know about that, thanks.

Contributor

wallygauze commented Sep 13, 2017

Nice, didn't know about that, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment