Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Kernel regression #3780

Closed
wants to merge 3 commits into from
Closed

Conversation

jmetzen
Copy link
Member

@jmetzen jmetzen commented Oct 16, 2014

Implementation of Nadaraya-Watson kernel regression with automatic bandwidth selection.

This simple algorithm is suited as a baseline for more complex algorithms, particularly in cases where the size of the training set is large compared to number of dimensions. An advantage of this algorithm is that it allows efficient bandwidth selection via leave-one-out cross-validation. This PR includes an example comparing kernel regression with SVR on a simple 1D toy dataset:

figure_1

figure_2

Potential future extensions of this could include metric learning for kernel regression (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.117.7080)

Any opinions on whether this is interesting for sklearn?

@agramfort
Copy link
Member

I am not sure we need this in sklearn. It won't scale and it does not
outperform algorithms we already have on a real problem AFAIK.

thoughts anyone?

value is left to the kernel; see the documentation for
sklearn.metrics.pairwise. Ignored by other kernels. If a sequence of
values is given, one of these values is selected which minimizes
the mean-squared-error of leave-one-out cross-validation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since when does the poly-kernel take a gamma parameter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know since when it takes a gamma parameter but in the master it does. According to its doc it computes K(X, Y) = (gamma <X, Y> + coef0)^degree

@larsmans
Copy link
Member

Agree with @agramfort regarding scalability, this algorithm takes O(n²) time. Maybe it's something for statsmodels instead? They're more into fine-grained analysis of not too large datasets, I think (or at least that's what I use statsmodels for).

@arjoly
Copy link
Member

arjoly commented Oct 17, 2014

#3117 is related to this.

@mblondel
Copy link
Member

I would love to have kernel ridge regression in scikit-learn instead.

Look here for inspiration:
https://github.com/mblondel/lightning/blob/master/lightning/impl/kernel_ridge.py
https://github.com/mblondel/lightning/blob/master/lightning/impl/tests/test_kernel_ridge.py

Just need to write an example (e.g., rewrite the polynomial interpolation example) and some documentation.

@agramfort
Copy link
Member

@mblondel what is your motivation?

@mblondel
Copy link
Member

Kernel ridge regression is a standard algorithm and trains faster than SVR for medium-scale datasets as it has a closed-form solution. If I'm not mistaken, this PR implements local kernel regression, which basically does interpolation at test time and is thus much more expensive.

@mblondel
Copy link
Member

Other advantages include efficient multi-output support (the cholesky decomposition needs to be computed only once) and efficient LOO CV.

@agramfort
Copy link
Member

I knew you would have a good reason :)

@jmetzen
Copy link
Member Author

jmetzen commented Oct 21, 2014

@mblondel I would like to see kernel ridge regression (KRR) in sklearn as well. How does your code in lightning relate to PR #2165? Is this PR outdated? I would be interested in working on a PR on KRR, but let us first discuss if there is enough support for adding this to sklearn. Do you have a pointer to how efficient LOO-CV is implemented for KRR?

Regarding the kernel regression, I think it falls into the same category as KNeighborsRegressor and LinearRegression: simple, well-known but prone to overfitting. Nevertheless, as baseline these methods are valuable in my opinion and practitioners might search for them in sklearn. And kernel regression should outperform KNeighborsRegressor typically. Multi-output support should also be trivial to add to kernel regression.

I also don't get why computational scalability would be more critical here than in related methods. If no automatic bandwith selection is performed, fitting takes O(1) and predicting O(n*m) (n number of training samples, m number of test samples). With bandwith selection, fitting becomes O(n^2). Both could be improved potentially by exploiting that many entries of the Gram matrix are essentially 0 for not too large bandwith.

KRR would also be O(n*m) during prediction (the alphas/dual_coef_ are not sparse). SVR would be better if the number of support vectors is considerably smaller than n. During fitting, KRR and SVR are certainly slower than kernel regression without bandwith selection.

One thing I consider interesting in kernel regression is that it allows to perform integrated distance metric learning (compare AISTATS-07 paper cited above). This is something which is currently missing in sklearn in my opinion.

@mblondel
Copy link
Member

@jmetzen In #2165, I added a kernel option to the Ridge class but most people seemed to prefer using a new KernelRidge class (which must thus be put in the top-level module). In addition, there was a difficulty regarding how to implement fit_intercept=True due to how this is currently done in the linear case. Most of the kernel stuff, including efficient LOO-CV, is already there because we implemented it for the n_features > n_samples case. So it is just a matter of exposing the code in a new interface. For pointers, see references in ridge.py. Maybe @eickenberg would like to pitch in.

and the keyword arguments passed to this object as kernel_params, and
should return a floating point number.

gamma : float, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float or array

@mblondel
Copy link
Member

@jmetzen I agree that kernel regression is somewhat standard, although kind of old school. I am not completely opposed to its inclusion and I agree it could go in the neighbor module. So +0 for me.

@eickenberg
Copy link
Contributor

Indeed ridge.py already contains references to an article about kernel ridge LOOE/V. (The implementation is buggy with sample weights, but works perfectly fine without). As can be seen by some function names, in the n_samples < n_features case, a 'kernelized' form of ridge regression is solved, which is nothing other than kernel ridge regression with a linear kernel.

I made this very explicit in my recent refactoring efforts, e.g. here https://github.com/scikit-learn/scikit-learn/pull/3383/files#diff-489e81d5edaa5bb1d352988699773551R202 . The generalization to kernel ridge with arbitrary kernels would be very easy if it weren't for the intercept issue as @mblondel already said. Then there is the more philosophical question of whether kernel extensions of linear models should still be considered linear or not (they are so in the corresponding RKHS).

(I think that once again my PR to update ridge regression has stagnated to unrebaseability, because I bit off a little more than I could chew in the time I had allocated myself to do it. That said, I think I now know how to break it down, starting by a homogenization of the LOO part.)

As for this current contribution, if I am not mistaken, the general set of methods is extensively described in one of the earlier chapters of Elements of Statistical Learning (not sure about this exact one?), so it is not out of this world to consider it. However scalability does seem to be an issue, I agree 100% with @larsmans description on this being a method for close scrutiny of small dimensional data. In the book one of the next chapters is local regression, which is the first order version if kernel interpolation is 0 order, and generalize (a bit) better. So yes, neighbors module would be the place I guess.

y_pred = Ky.sum(axis=0) / K.sum(axis=0)
mse[i] = ((y_pred - self.y) ** 2).mean()

return gamma_values[np.nanargmin(mse)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are nan results regular when the kernel is sensibly chosen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nan does not occur if the kernel's bandwidth gamma is chosen reasonably. Since this method optimizes the bandwidth, nan values might occur for too small bandwidth values in the given range of gamma values.

@jmetzen
Copy link
Member Author

jmetzen commented Oct 26, 2014

According to the feedback, there is no strong desire for having kernel regression in sklearn at the moment so I would hibernate this PR and put the code into an external repository with a link on https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets.

Any opinions on this?

@jmetzen
Copy link
Member Author

jmetzen commented Oct 26, 2014

Regarding kernel ridge regression: I will work on an example for @mblondel's implementation in lightning.

@agramfort
Copy link
Member

@mblondel is +1 on adding it (see his arguments). He somehow convinced. So
+0 on my side.

@amueller
Copy link
Member

amueller commented Nov 6, 2014

+1 for putting this code in a separate repo and adding a link
+1 for adding a KernelRidge estimator ;)

@jmetzen
Copy link
Member Author

jmetzen commented Nov 24, 2014

I added the code to the repo https://github.com/jmetzen/kernel_regression

Regarding KernelRidge: I've created a gist comparing SVR and @mblondel's implementation of KernelRidge in lighning: https://gist.github.com/jmetzen/3881f09d820783c13929
The most interesting result is probably the comparison of computation time for training and testing:

figure_2

As expected SVR is faster than KR during prediction by a constant factor (due to the reduced number of support vectors). KR is faster than SVR during training for middle-sized datasets but worse for small or large datasets. The latter is somewhat unexpected.

@mblondel How should we proceed with a PR on KernelRidge? I can create an initial PR based on your code in lightning but then the version history wouldn't reflect that you are the original author.

@mblondel
Copy link
Member

The choice of epsilon has a great impact on SVR. The smaller epsilon, the greater the number of support vectors and thus the slower the prediction. And in particular when epsilon=0, the solution should be completely dense.

Regarding KernelRidge, feel free to start from my code. I don't mind the commit history. We will need a KernelRidgeCV as well but this should be done in a separate PR. The code re-use issue between KernelRidge and Ridge (for the n_features > n_samples case) can be dealt with later (when @eickenberg 's work is finished).

@jmetzen
Copy link
Member Author

jmetzen commented Dec 7, 2014

@mblondel @eickenberg @amueller @agramfort I started a new PR on adding a KernelRidge estimator: PR #3942 .

@jmetzen jmetzen closed this Dec 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants