-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Update IncrementalPCA example to actually use batches #25379
base: main
Are you sure you want to change the base?
Conversation
Hi @dsblank, I'm not sure to understand what you mean by "to actually use batches". Internally |
Yes, but if you can use that interface, why not use the non-incremental version? It seems the main use for IncrementalPCA is when you don't want to use the auto-batching API, and manually provide the batch yourself (by, say, loading batches from files).
I think it is worthwhile to have such an example. I can update the code to show all three variations if that would help. |
@@ -33,15 +33,27 @@ | |||
y = iris.target | |||
|
|||
n_components = 2 | |||
ipca = IncrementalPCA(n_components=n_components, batch_size=10) | |||
X_ipca = ipca.fit_transform(X) | |||
X_ipca = IncrementalPCA(n_components=n_components) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the choice of X_ipca
as a variable name for the estimator instance. I think it should stay named ipca
.
X_*
variable names should be reserved for 2D arrays (original data or transformed data) in this example.
When you have a large dataset (that still fits in memory), using the minibatch version can converge a lot faster. While you need many iterations over the full dataset, with the full batch version, to converge, you can often reach almost the same quality of results within only a few epochs with the minibatch version because you update the learned parameters more often and by smaller steps.
There's already an example that explains how to use the partial_fit API when all the data is not available at once https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html |
batch.append(row) | ||
if len(batch) == 10: | ||
X_ipca.partial_fit(batch) | ||
batch = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be both simple and more efficient to slice batches:
for idx in range(0, len(X), batch_size):
X_batch = X[idx:idx + batch_size]
ipca.partial_fit(X_batch)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense!
I personally wouldn't mind showing that the |
@dsblank are you still interested in finalizing this PR ? |
What does this implement/fix? Explain your changes.
It wasn't clear to be how to actually use the API to train an IncrementalPCA. This updates the example, and gives the same results as before.