Optimizing the fine-tuning of PCA / KernelPCA n_components #19649

dcoulomb · 2021-03-08T20:51:51Z

This example shows a way to use cross-validation to select the best value for n_components in the PCA:
https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

However, it seems that there would be a more computation-efficient way to do it, fitting the PCA only once with the maximum value of n_components tested, and then applying the transformation with different values of n_components.
Indeed, when using GridSearchCV to optimize n_components in the PCA, there is a new fit for each iteration.
As the PCA can be very slow, optimizing it would be cool.

Any idea on how to do it easily with current implementation of PCA and GridSearchCV ?

TomDLT · 2021-03-09T02:29:12Z

A workaround can be done with SelectKBest and a custom dummy function ranking features in decreasing order.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline

def decreasing_order(X, y=None):
    n_features = X.shape[1]
    return np.arange(n_features, 0, -1)

n_components = 3
pca = PCA(n_components=n_components)
other_pca = make_pipeline(PCA(), SelectKBest(decreasing_order, k=n_components))

X = np.random.randn(100, 30)
y = np.random.randn(100)
np.testing.assert_almost_equal(pca.fit_transform(X, y),
                               other_pca.fit_transform(X, y))

Then, the pipeline can be optimized with GridSearchCV, changing k in SelectKBest, and using the caching option to save intermediate transforms in the pipeline:

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
pipeline = make_pipeline(PCA(), SelectKBest(decreasing_order, k=n_components),
                         Ridge(), memory='.')
gscv = GridSearchCV(estimator=pipeline,
                    param_grid=dict(selectkbest__k=[1, 2, 3]))
gscv.fit(X, y)

vnmabus · 2022-08-09T13:56:16Z

I actually wanted a similar thing for k-NN when optimizing the value of k: one could search for the maximum number of neighbours and then eliminate them one after the other without having to refit anything.
Maybe it would be good to have some API that could be implemented by estimators to receive a set of parameter combinations and return a dict of cloned and fitted estimators for each (or some lazy way to compute that).
That way hyperparameter search functions could use it to efficiently perform the search, using information that only the estimator itself knows.

Other solutions, such a warm_start parameter of some sort, presuppose that the user knows how to specify the parameters in a sensible order, and that parallelization won't mess with that.

TomDLT · 2022-08-09T20:07:04Z

For k-NN, a workaround is to precompute the graph with the largest number of neighbors considered, and give the precomputed graph to a subsequent estimator. For example:

import numpy as np
from sklearn.neighbors import KNeighborsTransformer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline

knnr = KNeighborsRegressor(n_neighbors=5)
other_knnr = make_pipeline(KNeighborsTransformer(n_neighbors=10),
                           KNeighborsRegressor(n_neighbors=5, metric="precomputed"))

X = np.random.randn(100, 30)
y = np.random.randn(100)
np.testing.assert_almost_equal(knnr.fit(X, y).predict(X),
                               other_knnr.fit(X, y).predict(X))

Then, the pipeline can be optimized with GridSearchCV, changing n_neighbors in KNeighborsRegressor, and using the caching option to save intermediate transforms in the pipeline:

from sklearn.model_selection import GridSearchCV
pipeline = make_pipeline(KNeighborsTransformer(n_neighbors=10),
                         KNeighborsRegressor(n_neighbors=5, metric="precomputed"),
                         memory='.')
gscv = GridSearchCV(estimator=pipeline,
                    param_grid=dict(kneighborsregressor__n_neighbors=[4, 6, 8, 10]))
gscv.fit(X, y)

dcoulomb added the New Feature label Mar 8, 2021

cmarmo added the module:decomposition label Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing the fine-tuning of PCA / KernelPCA n_components #19649

Optimizing the fine-tuning of PCA / KernelPCA n_components #19649

dcoulomb commented Mar 8, 2021 •

edited by TomDLT

TomDLT commented Mar 9, 2021 •

edited

vnmabus commented Aug 9, 2022

TomDLT commented Aug 9, 2022

Optimizing the fine-tuning of PCA / KernelPCA n_components #19649

Optimizing the fine-tuning of PCA / KernelPCA n_components #19649

Comments

dcoulomb commented Mar 8, 2021 • edited by TomDLT

TomDLT commented Mar 9, 2021 • edited

vnmabus commented Aug 9, 2022

TomDLT commented Aug 9, 2022

dcoulomb commented Mar 8, 2021 •

edited by TomDLT

TomDLT commented Mar 9, 2021 •

edited