Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing the fine-tuning of PCA / KernelPCA n_components #19649

Open
dcoulomb opened this issue Mar 8, 2021 · 3 comments
Open

Optimizing the fine-tuning of PCA / KernelPCA n_components #19649

dcoulomb opened this issue Mar 8, 2021 · 3 comments

Comments

@dcoulomb
Copy link

dcoulomb commented Mar 8, 2021

This example shows a way to use cross-validation to select the best value for n_components in the PCA:
https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html

However, it seems that there would be a more computation-efficient way to do it, fitting the PCA only once with the maximum value of n_components tested, and then applying the transformation with different values of n_components.
Indeed, when using GridSearchCV to optimize n_components in the PCA, there is a new fit for each iteration.
As the PCA can be very slow, optimizing it would be cool.

Any idea on how to do it easily with current implementation of PCA and GridSearchCV ?

@TomDLT
Copy link
Member

TomDLT commented Mar 9, 2021

A workaround can be done with SelectKBest and a custom dummy function ranking features in decreasing order.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline

def decreasing_order(X, y=None):
    n_features = X.shape[1]
    return np.arange(n_features, 0, -1)

n_components = 3
pca = PCA(n_components=n_components)
other_pca = make_pipeline(PCA(), SelectKBest(decreasing_order, k=n_components))

X = np.random.randn(100, 30)
y = np.random.randn(100)
np.testing.assert_almost_equal(pca.fit_transform(X, y),
                               other_pca.fit_transform(X, y))

Then, the pipeline can be optimized with GridSearchCV, changing k in SelectKBest, and using the caching option to save intermediate transforms in the pipeline:

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
pipeline = make_pipeline(PCA(), SelectKBest(decreasing_order, k=n_components),
                         Ridge(), memory='.')
gscv = GridSearchCV(estimator=pipeline,
                    param_grid=dict(selectkbest__k=[1, 2, 3]))
gscv.fit(X, y)

@vnmabus
Copy link
Contributor

vnmabus commented Aug 9, 2022

I actually wanted a similar thing for k-NN when optimizing the value of k: one could search for the maximum number of neighbours and then eliminate them one after the other without having to refit anything.
Maybe it would be good to have some API that could be implemented by estimators to receive a set of parameter combinations and return a dict of cloned and fitted estimators for each (or some lazy way to compute that).
That way hyperparameter search functions could use it to efficiently perform the search, using information that only the estimator itself knows.

Other solutions, such a warm_start parameter of some sort, presuppose that the user knows how to specify the parameters in a sensible order, and that parallelization won't mess with that.

@TomDLT
Copy link
Member

TomDLT commented Aug 9, 2022

For k-NN, a workaround is to precompute the graph with the largest number of neighbors considered, and give the precomputed graph to a subsequent estimator. For example:

import numpy as np
from sklearn.neighbors import KNeighborsTransformer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline

knnr = KNeighborsRegressor(n_neighbors=5)
other_knnr = make_pipeline(KNeighborsTransformer(n_neighbors=10),
                           KNeighborsRegressor(n_neighbors=5, metric="precomputed"))

X = np.random.randn(100, 30)
y = np.random.randn(100)
np.testing.assert_almost_equal(knnr.fit(X, y).predict(X),
                               other_knnr.fit(X, y).predict(X))

Then, the pipeline can be optimized with GridSearchCV, changing n_neighbors in KNeighborsRegressor, and using the caching option to save intermediate transforms in the pipeline:

from sklearn.model_selection import GridSearchCV
pipeline = make_pipeline(KNeighborsTransformer(n_neighbors=10),
                         KNeighborsRegressor(n_neighbors=5, metric="precomputed"),
                         memory='.')
gscv = GridSearchCV(estimator=pipeline,
                    param_grid=dict(kneighborsregressor__n_neighbors=[4, 6, 8, 10]))
gscv.fit(X, y)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants