Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC Update IncrementalPCA example to actually use batches #25379

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dsblank
Copy link

@dsblank dsblank commented Jan 12, 2023

What does this implement/fix? Explain your changes.

It wasn't clear to be how to actually use the API to train an IncrementalPCA. This updates the example, and gives the same results as before.

@jeremiedbb
Copy link
Member

jeremiedbb commented Jan 13, 2023

Hi @dsblank, I'm not sure to understand what you mean by "to actually use batches". Internally fit does use batches. This is controlled by the batch_size parameter. It's equivalent to calling partial_fit in a row, which is what fit does actually. Calling partial_fit manually is for advanced use cases like out of core, or when you don't have all the data from the start (as described here https://scikit-learn.org/stable/computing/scaling_strategies.html#incremental-learning), so I would keep this example as is.

@dsblank
Copy link
Author

dsblank commented Jan 13, 2023

@jeremiedbb :

I'm not sure to understand what you mean by "to actually use batches". Internally fit does use batches.

Yes, but if you can use that interface, why not use the non-incremental version?

It seems the main use for IncrementalPCA is when you don't want to use the auto-batching API, and manually provide the batch yourself (by, say, loading batches from files).

Calling partial_fit manually is for advanced use cases like out of core, or when you don't have all the data from the start

I think it is worthwhile to have such an example. I can update the code to show all three variations if that would help.

@@ -33,15 +33,27 @@
y = iris.target

n_components = 2
ipca = IncrementalPCA(n_components=n_components, batch_size=10)
X_ipca = ipca.fit_transform(X)
X_ipca = IncrementalPCA(n_components=n_components)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the choice of X_ipca as a variable name for the estimator instance. I think it should stay named ipca.

X_* variable names should be reserved for 2D arrays (original data or transformed data) in this example.

@jeremiedbb
Copy link
Member

Yes, but if you can use that interface, why not use the non-incremental version?

When you have a large dataset (that still fits in memory), using the minibatch version can converge a lot faster. While you need many iterations over the full dataset, with the full batch version, to converge, you can often reach almost the same quality of results within only a few epochs with the minibatch version because you update the learned parameters more often and by smaller steps.

I think it is worthwhile to have such an example. I can update the code to show all three variations if that would help.

There's already an example that explains how to use the partial_fit API when all the data is not available at once https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

batch.append(row)
if len(batch) == 10:
X_ipca.partial_fit(batch)
batch = []
Copy link
Member

@ogrisel ogrisel Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be both simple and more efficient to slice batches:

for idx in range(0, len(X), batch_size):
    X_batch = X[idx:idx + batch_size]
    ipca.partial_fit(X_batch)

...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense!

@ogrisel
Copy link
Member

ogrisel commented Jan 13, 2023

There's already an example that explains how to use the partial_fit API when all the data is not available at once https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

I personally wouldn't mind showing that the IncrementalPCA model also supports the partial_fit API in the plot_incremental_pca.py example, maybe with a comment that explains that each batch could be loaded progressively from a disk-based storage instead of slicing an in-memory X matrix.

@glemaitre glemaitre changed the title Update IncrementalPCA example to actually use batches DOC Update IncrementalPCA example to actually use batches Jan 13, 2023
@jeremiedbb
Copy link
Member

@dsblank are you still interested in finalizing this PR ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants