[WIP] Kernel regression #3780

jmetzen · 2014-10-16T18:46:42Z

Implementation of Nadaraya-Watson kernel regression with automatic bandwidth selection.

This simple algorithm is suited as a baseline for more complex algorithms, particularly in cases where the size of the training set is large compared to number of dimensions. An advantage of this algorithm is that it allows efficient bandwidth selection via leave-one-out cross-validation. This PR includes an example comparing kernel regression with SVR on a simple 1D toy dataset:

Potential future extensions of this could include metric learning for kernel regression (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.117.7080)

Any opinions on whether this is interesting for sklearn?

agramfort · 2014-10-16T19:37:13Z

I am not sure we need this in sklearn. It won't scale and it does not
outperform algorithms we already have on a real problem AFAIK.

thoughts anyone?

larsmans · 2014-10-16T20:37:17Z

sklearn/kernel_regression.py

+        value is left to the kernel; see the documentation for
+        sklearn.metrics.pairwise. Ignored by other kernels. If a sequence of
+        values is given, one of these values is selected which minimizes
+        the mean-squared-error of leave-one-out cross-validation.


Since when does the poly-kernel take a gamma parameter?

I don't know since when it takes a gamma parameter but in the master it does. According to its doc it computes K(X, Y) = (gamma <X, Y> + coef0)^degree

larsmans · 2014-10-16T20:39:57Z

Agree with @agramfort regarding scalability, this algorithm takes O(n²) time. Maybe it's something for statsmodels instead? They're more into fine-grained analysis of not too large datasets, I think (or at least that's what I use statsmodels for).

arjoly · 2014-10-17T12:21:47Z

#3117 is related to this.

mblondel · 2014-10-21T01:11:42Z

I would love to have kernel ridge regression in scikit-learn instead.

Look here for inspiration:
https://github.com/mblondel/lightning/blob/master/lightning/impl/kernel_ridge.py
https://github.com/mblondel/lightning/blob/master/lightning/impl/tests/test_kernel_ridge.py

Just need to write an example (e.g., rewrite the polynomial interpolation example) and some documentation.

agramfort · 2014-10-21T06:55:17Z

@mblondel what is your motivation?

mblondel · 2014-10-21T08:58:01Z

Kernel ridge regression is a standard algorithm and trains faster than SVR for medium-scale datasets as it has a closed-form solution. If I'm not mistaken, this PR implements local kernel regression, which basically does interpolation at test time and is thus much more expensive.

mblondel · 2014-10-21T09:03:05Z

Other advantages include efficient multi-output support (the cholesky decomposition needs to be computed only once) and efficient LOO CV.

agramfort · 2014-10-21T09:30:17Z

I knew you would have a good reason :)

jmetzen · 2014-10-21T17:03:40Z

@mblondel I would like to see kernel ridge regression (KRR) in sklearn as well. How does your code in lightning relate to PR #2165? Is this PR outdated? I would be interested in working on a PR on KRR, but let us first discuss if there is enough support for adding this to sklearn. Do you have a pointer to how efficient LOO-CV is implemented for KRR?

Regarding the kernel regression, I think it falls into the same category as KNeighborsRegressor and LinearRegression: simple, well-known but prone to overfitting. Nevertheless, as baseline these methods are valuable in my opinion and practitioners might search for them in sklearn. And kernel regression should outperform KNeighborsRegressor typically. Multi-output support should also be trivial to add to kernel regression.

I also don't get why computational scalability would be more critical here than in related methods. If no automatic bandwith selection is performed, fitting takes O(1) and predicting O(n*m) (n number of training samples, m number of test samples). With bandwith selection, fitting becomes O(n^2). Both could be improved potentially by exploiting that many entries of the Gram matrix are essentially 0 for not too large bandwith.

KRR would also be O(n*m) during prediction (the alphas/dual_coef_ are not sparse). SVR would be better if the number of support vectors is considerably smaller than n. During fitting, KRR and SVR are certainly slower than kernel regression without bandwith selection.

One thing I consider interesting in kernel regression is that it allows to perform integrated distance metric learning (compare AISTATS-07 paper cited above). This is something which is currently missing in sklearn in my opinion.

mblondel · 2014-10-21T18:30:51Z

@jmetzen In #2165, I added a kernel option to the Ridge class but most people seemed to prefer using a new KernelRidge class (which must thus be put in the top-level module). In addition, there was a difficulty regarding how to implement fit_intercept=True due to how this is currently done in the linear case. Most of the kernel stuff, including efficient LOO-CV, is already there because we implemented it for the n_features > n_samples case. So it is just a matter of exposing the code in a new interface. For pointers, see references in ridge.py. Maybe @eickenberg would like to pitch in.

mblondel · 2014-10-21T18:37:45Z

sklearn/kernel_regression.py

+        and the keyword arguments passed to this object as kernel_params, and
+        should return a floating point number.
+
+    gamma : float, default=None


float or array

mblondel · 2014-10-21T18:42:00Z

@jmetzen I agree that kernel regression is somewhat standard, although kind of old school. I am not completely opposed to its inclusion and I agree it could go in the neighbor module. So +0 for me.

eickenberg · 2014-10-21T19:04:02Z

Indeed ridge.py already contains references to an article about kernel ridge LOOE/V. (The implementation is buggy with sample weights, but works perfectly fine without). As can be seen by some function names, in the n_samples < n_features case, a 'kernelized' form of ridge regression is solved, which is nothing other than kernel ridge regression with a linear kernel.

I made this very explicit in my recent refactoring efforts, e.g. here https://github.com/scikit-learn/scikit-learn/pull/3383/files#diff-489e81d5edaa5bb1d352988699773551R202 . The generalization to kernel ridge with arbitrary kernels would be very easy if it weren't for the intercept issue as @mblondel already said. Then there is the more philosophical question of whether kernel extensions of linear models should still be considered linear or not (they are so in the corresponding RKHS).

(I think that once again my PR to update ridge regression has stagnated to unrebaseability, because I bit off a little more than I could chew in the time I had allocated myself to do it. That said, I think I now know how to break it down, starting by a homogenization of the LOO part.)

As for this current contribution, if I am not mistaken, the general set of methods is extensively described in one of the earlier chapters of Elements of Statistical Learning (not sure about this exact one?), so it is not out of this world to consider it. However scalability does seem to be an issue, I agree 100% with @larsmans description on this being a method for close scrutiny of small dimensional data. In the book one of the next chapters is local regression, which is the first order version if kernel interpolation is 0 order, and generalize (a bit) better. So yes, neighbors module would be the place I guess.

eickenberg · 2014-10-21T19:10:34Z

sklearn/kernel_regression.py

+            y_pred = Ky.sum(axis=0) / K.sum(axis=0)
+            mse[i] = ((y_pred - self.y) ** 2).mean()
+
+        return gamma_values[np.nanargmin(mse)]


Are nan results regular when the kernel is sensibly chosen?

nan does not occur if the kernel's bandwidth gamma is chosen reasonably. Since this method optimizes the bandwidth, nan values might occur for too small bandwidth values in the given range of gamma values.

jmetzen · 2014-10-26T16:51:47Z

According to the feedback, there is no strong desire for having kernel regression in sklearn at the moment so I would hibernate this PR and put the code into an external repository with a link on https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets.

Any opinions on this?

jmetzen · 2014-10-26T16:56:26Z

Regarding kernel ridge regression: I will work on an example for @mblondel's implementation in lightning.

agramfort · 2014-10-26T17:45:15Z

@mblondel is +1 on adding it (see his arguments). He somehow convinced. So
+0 on my side.

amueller · 2014-11-06T16:03:40Z

+1 for putting this code in a separate repo and adding a link
+1 for adding a KernelRidge estimator ;)

jmetzen · 2014-11-24T18:50:05Z

I added the code to the repo https://github.com/jmetzen/kernel_regression

Regarding KernelRidge: I've created a gist comparing SVR and @mblondel's implementation of KernelRidge in lighning: https://gist.github.com/jmetzen/3881f09d820783c13929
The most interesting result is probably the comparison of computation time for training and testing:

As expected SVR is faster than KR during prediction by a constant factor (due to the reduced number of support vectors). KR is faster than SVR during training for middle-sized datasets but worse for small or large datasets. The latter is somewhat unexpected.

@mblondel How should we proceed with a PR on KernelRidge? I can create an initial PR based on your code in lightning but then the version history wouldn't reflect that you are the original author.

mblondel · 2014-11-25T02:00:37Z

The choice of epsilon has a great impact on SVR. The smaller epsilon, the greater the number of support vectors and thus the slower the prediction. And in particular when epsilon=0, the solution should be completely dense.

Regarding KernelRidge, feel free to start from my code. I don't mind the commit history. We will need a KernelRidgeCV as well but this should be done in a separate PR. The code re-use issue between KernelRidge and Ridge (for the n_features > n_samples case) can be dealt with later (when @eickenberg 's work is finished).

jmetzen · 2014-12-07T10:29:17Z

@mblondel @eickenberg @amueller @agramfort I started a new PR on adding a KernelRidge estimator: PR #3942 .

Jan Hendrik Metzen added 3 commits October 16, 2014 20:13

ADD Kernel regression with automatic bandwidth selection

f874473

ADD Example comparing Kernel Regression and SVR on a simple toy dataset

fc8b6d3

COSMIT Typos fixed and copy-paste leftover deleted

df2aabf

larsmans reviewed Oct 16, 2014
View reviewed changes

mblondel reviewed Oct 21, 2014
View reviewed changes

eickenberg reviewed Oct 21, 2014
View reviewed changes

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

jmetzen closed this Dec 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Kernel regression #3780

[WIP] Kernel regression #3780

jmetzen commented Oct 16, 2014

agramfort commented Oct 16, 2014

larsmans Oct 16, 2014

jmetzen Oct 26, 2014

larsmans commented Oct 16, 2014

arjoly commented Oct 17, 2014

mblondel commented Oct 21, 2014

agramfort commented Oct 21, 2014

mblondel commented Oct 21, 2014

mblondel commented Oct 21, 2014

agramfort commented Oct 21, 2014

jmetzen commented Oct 21, 2014

mblondel commented Oct 21, 2014

mblondel Oct 21, 2014

mblondel commented Oct 21, 2014

eickenberg commented Oct 21, 2014

eickenberg Oct 21, 2014

jmetzen Oct 26, 2014

jmetzen commented Oct 26, 2014

jmetzen commented Oct 26, 2014

agramfort commented Oct 26, 2014

amueller commented Nov 6, 2014

jmetzen commented Nov 24, 2014

mblondel commented Nov 25, 2014

jmetzen commented Dec 7, 2014

[WIP] Kernel regression #3780

[WIP] Kernel regression #3780

Conversation

jmetzen commented Oct 16, 2014

agramfort commented Oct 16, 2014

larsmans Oct 16, 2014

Choose a reason for hiding this comment

jmetzen Oct 26, 2014

Choose a reason for hiding this comment

larsmans commented Oct 16, 2014

arjoly commented Oct 17, 2014

mblondel commented Oct 21, 2014

agramfort commented Oct 21, 2014

mblondel commented Oct 21, 2014

mblondel commented Oct 21, 2014

agramfort commented Oct 21, 2014

jmetzen commented Oct 21, 2014

mblondel commented Oct 21, 2014

mblondel Oct 21, 2014

Choose a reason for hiding this comment

mblondel commented Oct 21, 2014

eickenberg commented Oct 21, 2014

eickenberg Oct 21, 2014

Choose a reason for hiding this comment

jmetzen Oct 26, 2014

Choose a reason for hiding this comment

jmetzen commented Oct 26, 2014

jmetzen commented Oct 26, 2014

agramfort commented Oct 26, 2014

amueller commented Nov 6, 2014

jmetzen commented Nov 24, 2014

mblondel commented Nov 25, 2014

jmetzen commented Dec 7, 2014