-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Kernel regression #3780
[WIP] Kernel regression #3780
Conversation
I am not sure we need this in sklearn. It won't scale and it does not thoughts anyone? |
value is left to the kernel; see the documentation for | ||
sklearn.metrics.pairwise. Ignored by other kernels. If a sequence of | ||
values is given, one of these values is selected which minimizes | ||
the mean-squared-error of leave-one-out cross-validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since when does the poly-kernel take a gamma
parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know since when it takes a gamma parameter but in the master it does. According to its doc it computes K(X, Y) = (gamma <X, Y> + coef0)^degree
Agree with @agramfort regarding scalability, this algorithm takes O(n²) time. Maybe it's something for statsmodels instead? They're more into fine-grained analysis of not too large datasets, I think (or at least that's what I use statsmodels for). |
#3117 is related to this. |
I would love to have kernel ridge regression in scikit-learn instead. Look here for inspiration: Just need to write an example (e.g., rewrite the polynomial interpolation example) and some documentation. |
@mblondel what is your motivation? |
Kernel ridge regression is a standard algorithm and trains faster than SVR for medium-scale datasets as it has a closed-form solution. If I'm not mistaken, this PR implements local kernel regression, which basically does interpolation at test time and is thus much more expensive. |
Other advantages include efficient multi-output support (the cholesky decomposition needs to be computed only once) and efficient LOO CV. |
I knew you would have a good reason :) |
@mblondel I would like to see kernel ridge regression (KRR) in sklearn as well. How does your code in lightning relate to PR #2165? Is this PR outdated? I would be interested in working on a PR on KRR, but let us first discuss if there is enough support for adding this to sklearn. Do you have a pointer to how efficient LOO-CV is implemented for KRR? Regarding the kernel regression, I think it falls into the same category as KNeighborsRegressor and LinearRegression: simple, well-known but prone to overfitting. Nevertheless, as baseline these methods are valuable in my opinion and practitioners might search for them in sklearn. And kernel regression should outperform KNeighborsRegressor typically. Multi-output support should also be trivial to add to kernel regression. I also don't get why computational scalability would be more critical here than in related methods. If no automatic bandwith selection is performed, fitting takes O(1) and predicting O(n*m) (n number of training samples, m number of test samples). With bandwith selection, fitting becomes O(n^2). Both could be improved potentially by exploiting that many entries of the Gram matrix are essentially 0 for not too large bandwith. KRR would also be O(n*m) during prediction (the alphas/dual_coef_ are not sparse). SVR would be better if the number of support vectors is considerably smaller than n. During fitting, KRR and SVR are certainly slower than kernel regression without bandwith selection. One thing I consider interesting in kernel regression is that it allows to perform integrated distance metric learning (compare AISTATS-07 paper cited above). This is something which is currently missing in sklearn in my opinion. |
@jmetzen In #2165, I added a |
and the keyword arguments passed to this object as kernel_params, and | ||
should return a floating point number. | ||
|
||
gamma : float, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float or array
@jmetzen I agree that kernel regression is somewhat standard, although kind of old school. I am not completely opposed to its inclusion and I agree it could go in the |
Indeed ridge.py already contains references to an article about kernel ridge LOOE/V. (The implementation is buggy with sample weights, but works perfectly fine without). As can be seen by some function names, in the I made this very explicit in my recent refactoring efforts, e.g. here https://github.com/scikit-learn/scikit-learn/pull/3383/files#diff-489e81d5edaa5bb1d352988699773551R202 . The generalization to kernel ridge with arbitrary kernels would be very easy if it weren't for the intercept issue as @mblondel already said. Then there is the more philosophical question of whether kernel extensions of linear models should still be considered linear or not (they are so in the corresponding RKHS). (I think that once again my PR to update ridge regression has stagnated to unrebaseability, because I bit off a little more than I could chew in the time I had allocated myself to do it. That said, I think I now know how to break it down, starting by a homogenization of the LOO part.) As for this current contribution, if I am not mistaken, the general set of methods is extensively described in one of the earlier chapters of Elements of Statistical Learning (not sure about this exact one?), so it is not out of this world to consider it. However scalability does seem to be an issue, I agree 100% with @larsmans description on this being a method for close scrutiny of small dimensional data. In the book one of the next chapters is local regression, which is the first order version if kernel interpolation is 0 order, and generalize (a bit) better. So yes, neighbors module would be the place I guess. |
y_pred = Ky.sum(axis=0) / K.sum(axis=0) | ||
mse[i] = ((y_pred - self.y) ** 2).mean() | ||
|
||
return gamma_values[np.nanargmin(mse)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are nan
results regular when the kernel is sensibly chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nan does not occur if the kernel's bandwidth gamma is chosen reasonably. Since this method optimizes the bandwidth, nan values might occur for too small bandwidth values in the given range of gamma values.
According to the feedback, there is no strong desire for having kernel regression in sklearn at the moment so I would hibernate this PR and put the code into an external repository with a link on https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets. Any opinions on this? |
Regarding kernel ridge regression: I will work on an example for @mblondel's implementation in lightning. |
@mblondel is +1 on adding it (see his arguments). He somehow convinced. So |
+1 for putting this code in a separate repo and adding a link |
I added the code to the repo https://github.com/jmetzen/kernel_regression Regarding KernelRidge: I've created a gist comparing SVR and @mblondel's implementation of KernelRidge in lighning: https://gist.github.com/jmetzen/3881f09d820783c13929 As expected SVR is faster than KR during prediction by a constant factor (due to the reduced number of support vectors). KR is faster than SVR during training for middle-sized datasets but worse for small or large datasets. The latter is somewhat unexpected. @mblondel How should we proceed with a PR on KernelRidge? I can create an initial PR based on your code in lightning but then the version history wouldn't reflect that you are the original author. |
The choice of epsilon has a great impact on SVR. The smaller epsilon, the greater the number of support vectors and thus the slower the prediction. And in particular when epsilon=0, the solution should be completely dense. Regarding KernelRidge, feel free to start from my code. I don't mind the commit history. We will need a KernelRidgeCV as well but this should be done in a separate PR. The code re-use issue between KernelRidge and Ridge (for the n_features > n_samples case) can be dealt with later (when @eickenberg 's work is finished). |
@mblondel @eickenberg @amueller @agramfort I started a new PR on adding a KernelRidge estimator: PR #3942 . |
Implementation of Nadaraya-Watson kernel regression with automatic bandwidth selection.
This simple algorithm is suited as a baseline for more complex algorithms, particularly in cases where the size of the training set is large compared to number of dimensions. An advantage of this algorithm is that it allows efficient bandwidth selection via leave-one-out cross-validation. This PR includes an example comparing kernel regression with SVR on a simple 1D toy dataset:
Potential future extensions of this could include metric learning for kernel regression (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.117.7080)
Any opinions on whether this is interesting for sklearn?