DOC Update IncrementalPCA example to actually use batches #25379

dsblank · 2023-01-12T18:56:29Z

What does this implement/fix? Explain your changes.

It wasn't clear to be how to actually use the API to train an IncrementalPCA. This updates the example, and gives the same results as before.

jeremiedbb · 2023-01-13T11:29:19Z

Hi @dsblank, I'm not sure to understand what you mean by "to actually use batches". Internally fit does use batches. This is controlled by the batch_size parameter. It's equivalent to calling partial_fit in a row, which is what fit does actually. Calling partial_fit manually is for advanced use cases like out of core, or when you don't have all the data from the start (as described here https://scikit-learn.org/stable/computing/scaling_strategies.html#incremental-learning), so I would keep this example as is.

dsblank · 2023-01-13T14:48:03Z

@jeremiedbb :

I'm not sure to understand what you mean by "to actually use batches". Internally fit does use batches.

Yes, but if you can use that interface, why not use the non-incremental version?

It seems the main use for IncrementalPCA is when you don't want to use the auto-batching API, and manually provide the batch yourself (by, say, loading batches from files).

Calling partial_fit manually is for advanced use cases like out of core, or when you don't have all the data from the start

I think it is worthwhile to have such an example. I can update the code to show all three variations if that would help.

ogrisel · 2023-01-13T15:02:42Z

examples/decomposition/plot_incremental_pca.py

@@ -33,15 +33,27 @@
 y = iris.target

 n_components = 2
-ipca = IncrementalPCA(n_components=n_components, batch_size=10)
-X_ipca = ipca.fit_transform(X)
+X_ipca = IncrementalPCA(n_components=n_components)


I don't understand the choice of X_ipca as a variable name for the estimator instance. I think it should stay named ipca.

X_* variable names should be reserved for 2D arrays (original data or transformed data) in this example.

jeremiedbb · 2023-01-13T15:03:20Z

Yes, but if you can use that interface, why not use the non-incremental version?

When you have a large dataset (that still fits in memory), using the minibatch version can converge a lot faster. While you need many iterations over the full dataset, with the full batch version, to converge, you can often reach almost the same quality of results within only a few epochs with the minibatch version because you update the learned parameters more often and by smaller steps.

I think it is worthwhile to have such an example. I can update the code to show all three variations if that would help.

There's already an example that explains how to use the partial_fit API when all the data is not available at once https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

ogrisel · 2023-01-13T15:06:10Z

examples/decomposition/plot_incremental_pca.py

+    batch.append(row)
+    if len(batch) == 10:
+        X_ipca.partial_fit(batch)
+        batch = []


It would be both simple and more efficient to slice batches:

for idx in range(0, len(X), batch_size): X_batch = X[idx:idx + batch_size] ipca.partial_fit(X_batch) ...

Yes, that makes sense!

ogrisel · 2023-01-13T15:11:20Z

There's already an example that explains how to use the partial_fit API when all the data is not available at once https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

I personally wouldn't mind showing that the IncrementalPCA model also supports the partial_fit API in the plot_incremental_pca.py example, maybe with a comment that explains that each batch could be loaded progressively from a disk-based storage instead of slicing an in-memory X matrix.

jeremiedbb · 2023-03-01T00:02:16Z

@dsblank are you still interested in finalizing this PR ?

Use batches to update IncrementalPCA

3aadb90

ogrisel reviewed Jan 13, 2023

View reviewed changes

glemaitre changed the title ~~Update IncrementalPCA example to actually use batches~~ DOC Update IncrementalPCA example to actually use batches Jan 13, 2023

github-actions bot added the Documentation label Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC Update IncrementalPCA example to actually use batches #25379

DOC Update IncrementalPCA example to actually use batches #25379

dsblank commented Jan 12, 2023

jeremiedbb commented Jan 13, 2023 •

edited

dsblank commented Jan 13, 2023

ogrisel Jan 13, 2023

jeremiedbb commented Jan 13, 2023

ogrisel Jan 13, 2023 •

edited

dsblank Jan 13, 2023

ogrisel commented Jan 13, 2023 •

edited

jeremiedbb commented Mar 1, 2023

DOC Update IncrementalPCA example to actually use batches #25379

Are you sure you want to change the base?

DOC Update IncrementalPCA example to actually use batches #25379

Conversation

dsblank commented Jan 12, 2023

What does this implement/fix? Explain your changes.

jeremiedbb commented Jan 13, 2023 • edited

dsblank commented Jan 13, 2023

ogrisel Jan 13, 2023

Choose a reason for hiding this comment

jeremiedbb commented Jan 13, 2023

ogrisel Jan 13, 2023 • edited

Choose a reason for hiding this comment

dsblank Jan 13, 2023

Choose a reason for hiding this comment

ogrisel commented Jan 13, 2023 • edited

jeremiedbb commented Mar 1, 2023

jeremiedbb commented Jan 13, 2023 •

edited

ogrisel Jan 13, 2023 •

edited

ogrisel commented Jan 13, 2023 •

edited