-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing the fine-tuning of PCA / KernelPCA n_components #19649
Comments
A workaround can be done with SelectKBest and a custom dummy function ranking features in decreasing order. import numpy as np
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline
def decreasing_order(X, y=None):
n_features = X.shape[1]
return np.arange(n_features, 0, -1)
n_components = 3
pca = PCA(n_components=n_components)
other_pca = make_pipeline(PCA(), SelectKBest(decreasing_order, k=n_components))
X = np.random.randn(100, 30)
y = np.random.randn(100)
np.testing.assert_almost_equal(pca.fit_transform(X, y),
other_pca.fit_transform(X, y)) Then, the pipeline can be optimized with GridSearchCV, changing from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
pipeline = make_pipeline(PCA(), SelectKBest(decreasing_order, k=n_components),
Ridge(), memory='.')
gscv = GridSearchCV(estimator=pipeline,
param_grid=dict(selectkbest__k=[1, 2, 3]))
gscv.fit(X, y) |
I actually wanted a similar thing for k-NN when optimizing the value of k: one could search for the maximum number of neighbours and then eliminate them one after the other without having to refit anything. Other solutions, such a |
For k-NN, a workaround is to precompute the graph with the largest number of neighbors considered, and give the precomputed graph to a subsequent estimator. For example: import numpy as np
from sklearn.neighbors import KNeighborsTransformer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
knnr = KNeighborsRegressor(n_neighbors=5)
other_knnr = make_pipeline(KNeighborsTransformer(n_neighbors=10),
KNeighborsRegressor(n_neighbors=5, metric="precomputed"))
X = np.random.randn(100, 30)
y = np.random.randn(100)
np.testing.assert_almost_equal(knnr.fit(X, y).predict(X),
other_knnr.fit(X, y).predict(X)) Then, the pipeline can be optimized with from sklearn.model_selection import GridSearchCV
pipeline = make_pipeline(KNeighborsTransformer(n_neighbors=10),
KNeighborsRegressor(n_neighbors=5, metric="precomputed"),
memory='.')
gscv = GridSearchCV(estimator=pipeline,
param_grid=dict(kneighborsregressor__n_neighbors=[4, 6, 8, 10]))
gscv.fit(X, y) |
This example shows a way to use cross-validation to select the best value for n_components in the PCA:
https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
However, it seems that there would be a more computation-efficient way to do it, fitting the PCA only once with the maximum value of n_components tested, and then applying the transformation with different values of n_components.
Indeed, when using GridSearchCV to optimize n_components in the PCA, there is a new fit for each iteration.
As the PCA can be very slow, optimizing it would be cool.
Any idea on how to do it easily with current implementation of PCA and GridSearchCV ?
The text was updated successfully, but these errors were encountered: