[MRG] Added kernel weighting functions for neighbors classes #3117

Open
wants to merge 27 commits into
from

7 participants

@nmayorov

This patch enables the use of kernel functions for neighbors weighting.

It adds the following keywords for weights argument: tophat, gaussian, epanechnikov, exponential, linear, cosine, i.e. all kernels presented in KernelDensity class.

For KNeighborsClassifier and KNeighborsRegressor the kernel bandwidth is equal to the distance to the k+1 nearest neighbor (i. e. it depends on a query point).

For RadiusNeighborsClassifier and RadiusNeighborsRegressor the kernel bandwidth is equal to the radius parameter of the classifier (i. e. it is constant).

Please, take a look.

@coveralls

Coverage Status

Coverage remained the same when pulling 9d3f813 on nmayorov:neighbors_kernels into 6945d5b on scikit-learn:master.

Nikolay Mayorov added some commits Apr 29, 2014
@nmayorov nmayorov changed the title from Added kernel weighting functions for neighbors classes to [WIP] Added kernel weighting functions for neighbors classes Oct 7, 2014
@nmayorov

Hello! This has been here for months, no attention unfortunately.

Let me try to explain what the intention of this PR in more detail.

Currently there are two options for weighting neighbor predictions: uniform (majority vote) and dist, which uses 1/dist weights. The first one is classic, the second one is quite controversial (infinity which might occur is not fun to deal with, I'm not sure if it's a good option to be honest.)

There is also a probabilistic interpretation on neighbor methods, which manifests itself in sklearn.neighbors.KernelDensity. We can also use it for prediction in kNN: estimate the PDF for each class at a query point and then pick one with the highest probability (Bayesian approach). It can be very easily done by using kernels (as in kernel density estimation) as weighting functions.

One subtle point is that some kernel functions (like gaussian) are non-zero in infinite interval, in kNN prediction we have to use their "truncated" versions. But I don't think it matters much in practice. As far as selected kernel bandwidth concerned, please refer to my opening message.

Other neighbor weighting strategies also exist, which aren't directly associated with kernel density estimation. Potentially we can also incorporate them into sklearn.neighbors. Overall I think there should be more options besides uniform and dist.


Please tell me whether you think it is useful or not. I'm willing to properly finish this PR (like add narrative doc and so on.)

Ping @agramfort @jakevdp @larsman Anyone?

@agramfort
scikit-learn member

can you provide some benchmark results that demonstrate the usefulness of this on a public dataset? in terms of accuracy and computation time.

thanks

@nmayorov

Hi, Alexandre.

I created an ipython notebook where I test different weighs on the famous data set. Take a look http://nbviewer.ipython.org/gist/nmayorov/9b11161f9b66df12d2b9.

@agramfort
scikit-learn member

ok good. Can you comment on extra computation time if any significant?
How long is test time?
You'll need to add a paragraph to the narrative docs explaining the kernels
and why one might want to use them.

@nmayorov

It does not required any significant extra time, it's simply a matter of evaluation of a different weighting function (as 1 / dist). I added benchmarks in ipython notebook.

Also I remembered one thing: such technique for regression is known by the name of Nadaraya-Watson estimator. And in fact there is the whole chapter about similar methods in "The elements of statistical learning" (for example, check out FIGURE 6.1 from there, pretty illustrative.)

With a proper kernel (non-zero only in the range of bandwidth) we can do this regression locally using only small number of neighbors. Perhaps we should keep kernels which are non-zero only locally in the bandwidth range, to have theoretical integrity. What do you think?

About narrative docs. I think I'll just mention that it can be interpreted as KDE for classification and Nadaraya-Watson estimation for regression, but won't go deep into that (also I don't think I can.) After all this is just a few new reasonable weighting functions, which gives more credit to closer neighbors.

Give me some feedback.

@arjoly
scikit-learn member

ping @jakevdp You might want have a look to this pr.

@agramfort agramfort and 1 other commented on an outdated diff Oct 14, 2014
sklearn/neighbors/base.py
"""Get the weights from an array of distances and a parameter ``weights``
Parameters
===========
dist: ndarray
The input distances
- weights: {'uniform', 'distance' or a callable}
+ weights: None, string from VALID_WEIGHTS or callable
@agramfort
scikit-learn member
agramfort added a line comment Oct 14, 2014

I would write

weights: None, str or callable
    The kind of weighting used. The valid string parameters are
    'uniform', 'distance', 'tophat', etc....

the mathematical formula of the different kernels should be in the narrative
doc ideally with a plot of the kernel shapes.

@nmayorov
nmayorov added a line comment Oct 14, 2014

It is a private function of the module, I though I can be more technically explicit and mention VALID_WEIGHTS. (Makes sense?)

@agramfort
scikit-learn member
agramfort added a line comment Oct 15, 2014

it was explicit and clear before please keep it clear and explicit.

@nmayorov
nmayorov added a line comment Oct 15, 2014

Got you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort agramfort and 1 other commented on an outdated diff Oct 14, 2014
sklearn/neighbors/classification.py
@@ -187,7 +198,16 @@ def predict_proba(self, X):
"""
X = atleast2d_or_csr(X)
- neigh_dist, neigh_ind = self.kneighbors(X)
+ if self.weights in KERNEL_WEIGHTS:
+ neigh_dist, neigh_ind = \
+ self.kneighbors(X, n_neighbors=self.n_neighbors + 1)
+ bandwidth = neigh_dist[:, -1]
+ neigh_dist, neigh_ind = neigh_dist[:, :-1], neigh_ind[:, :-1]
+ weights = _get_weights(neigh_dist, self.weights,
+ bandwidth=bandwidth)
@agramfort
scikit-learn member
agramfort added a line comment Oct 14, 2014

this block of lines seems to be duplicated a few times. Why not updating _get_weights to avoid these duplications?

@nmayorov
nmayorov added a line comment Oct 14, 2014

Duplications are not good, but I don't see a better way here.

For bandwidth in _get_weights we can pass an array (for NearestNeighbor) or single value (for RadiusNeighbor) — the usage is case dependent. Also there is small logic for discarding the last neigh_ind row.

I can modify KNeighborsMixin.kneighbors() such that it will return neigh_dist, neigh_ind and bandwidth (depending on self.weights.) But it seems bad.

Refactoring this into a new helper function in base also seems excessive.

@agramfort
scikit-learn member
agramfort added a line comment Oct 15, 2014

you can change the API of _get_weights to avoid the duplication.

@nmayorov
nmayorov added a line comment Oct 15, 2014

OK, I'll see to that, but still not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort
scikit-learn member

you need to add a paragraph to the narrative doc and update an example to show case this feature.

@agramfort agramfort and 2 others commented on an outdated diff Oct 15, 2014
sklearn/neighbors/base.py
return dist
elif callable(weights):
return weights(dist)
else:
- raise ValueError("weights not recognized: should be 'uniform', "
- "'distance', or a callable function")
@agramfort
scikit-learn member
agramfort added a line comment Oct 15, 2014

there was a nice error message now it's gone... please step into the shoes of the user that messes up the name of the kernel

@nmayorov
nmayorov added a line comment Oct 15, 2014

It is checked in _check_weights, previously there was a duplication. The error message will appear all right.

@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a line comment Oct 21, 2014

Good job redesigning this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Nikolay Mayorov Formatting tiny fix 61efb39
@arjoly arjoly referenced this pull request Oct 17, 2014
Closed

[WIP] Kernel regression #3780

@nmayorov

I've done some work, please review.

@agramfort agramfort and 1 other commented on an outdated diff Oct 18, 2014
examples/neighbors/plot_regression.py
@@ -34,16 +34,16 @@
# Fit regression model
n_neighbors = 5
-for i, weights in enumerate(['uniform', 'distance']):
+plt.figure(figsize=(8, 9))
+for i, weights in enumerate(['uniform', 'distance', 'epanechnikov']):
@agramfort
scikit-learn member
agramfort added a line comment Oct 18, 2014

running this example it seems that epanechnikov is a kernel in this list that does not force the line to go through the training point.

it terms of user understanding this point should be explained in the doc.

figure_1

@nmayorov
nmayorov added a line comment Oct 18, 2014

I wouldn't say that it is a characteristic property of smoothing kernels, The estimate with smoothing kernels is less bumpy and more smooth than with uniform weights and that's all. (Why did you imply that 'uniform' forces the line to go through training points?)

And only 'distance' shows this weird property, that the line has to pass through every training point. (Which again the argument not to use it all.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort
scikit-learn member

@dsullivan7 can you review this and help with the English wording? thanks a lot

@dsullivan7

Yes, will do in a bit

@dsullivan7 dsullivan7 commented on an outdated diff Oct 20, 2014
doc/modules/neighbors.rst
``weights = 'distance'`` assigns weights proportional to the inverse of the
-distance from the query point. Alternatively, a user-defined function of the
-distance can be supplied which is used to compute the weights.
-
+distance from the query point. Other allowed values for ``weights`` are names
+of :ref:`kernels <kernels>`: ``'tophat'``, ``'gaussian'``, ``'epanechnikov'``,
@dsullivan7
dsullivan7 added a line comment Oct 20, 2014

take out "names of"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dsullivan7 dsullivan7 and 1 other commented on an outdated diff Oct 20, 2014
doc/modules/neighbors.rst
``weights = 'distance'`` assigns weights proportional to the inverse of the
-distance from the query point. Alternatively, a user-defined function of the
-distance can be supplied which is used to compute the weights.
-
+distance from the query point. Other allowed values for ``weights`` are names
+of :ref:`kernels <kernels>`: ``'tophat'``, ``'gaussian'``, ``'epanechnikov'``,
+``'exponential'``, ``'linear'``, ``'cosine'``. The bandwidth of a kernel is
+equal to the distance to :math:`k + 1` neighbor for
+:class:`KNeighborsClassifier` and to the radius :math:`r` for
+:class:`RadiusNeighborsClassifier`. The sum of weighted by a kernel votes for
@dsullivan7
dsullivan7 added a line comment Oct 20, 2014

Is the bandwidth of a kernel equal to a single distance for KNeighborsClassifier? Like if I choose k=5 the bandwidth is equal to distance of the 6th closest? If that's the case it should be "equal to the distance to the :math:k + 1 neighbor"

@nmayorov
nmayorov added a line comment Oct 20, 2014

Yes, the bandwidth is equal to a single distance. I will add "the" before :math:k + 1 neighbor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dsullivan7 dsullivan7 and 1 other commented on an outdated diff Oct 20, 2014
doc/modules/neighbors.rst
``weights = 'distance'`` assigns weights proportional to the inverse of the
-distance from the query point. Alternatively, a user-defined function of the
-distance can be supplied which is used to compute the weights.
-
+distance from the query point. Other allowed values for ``weights`` are names
+of :ref:`kernels <kernels>`: ``'tophat'``, ``'gaussian'``, ``'epanechnikov'``,
+``'exponential'``, ``'linear'``, ``'cosine'``. The bandwidth of a kernel is
+equal to the distance to :math:`k + 1` neighbor for
+:class:`KNeighborsClassifier` and to the radius :math:`r` for
+:class:`RadiusNeighborsClassifier`. The sum of weighted by a kernel votes for
+a class is proportional to the probability density for this class estimated
@dsullivan7
dsullivan7 added a line comment Oct 20, 2014

I believe this should be "The sum of weighted kernel votes for a class" (remove "by a")

@nmayorov
nmayorov added a line comment Oct 20, 2014

Is "kernel votes" some kind of special term here? Maybe you are right, but it doesn't sound clear to me. We have a set of votes for every class, then we multiply them by values of a kernel function at positions of voting neighbors (so "weight by a kernel".) Do you still think it should be changed?

@dsullivan7
dsullivan7 added a line comment Oct 21, 2014

Ah, I see: weighted-by-a-kernel votes. Hmmm. Perhaps "kernel-weighted votes"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dsullivan7 dsullivan7 commented on an outdated diff Oct 20, 2014
doc/modules/neighbors.rst
@@ -224,8 +235,15 @@ to the regression than faraway points. This can be accomplished through
the ``weights`` keyword. The default value, ``weights = 'uniform'``,
assigns equal weights to all points. ``weights = 'distance'`` assigns
weights proportional to the inverse of the distance from the query point.
-Alternatively, a user-defined function of the distance can be supplied,
-which will be used to compute the weights.
+Other allowed values for ``weights`` are names of :ref:`kernels <kernels>`:
+``'tophat'``, ``'gaussian'``, ``'epanechnikov'``, ``'exponential'``,
+``'linear'``, ``'cosine'``. The bandwidth of a kernel is equal to the distance
+to :math:`k + 1` neighbor for :class:`KNeighborsRegressor` and to the radius
+:math:`r` for :class:`RadiusNeighborsRegressor`. Using kernels for nearest
+neighbor regression results in smoother fitted function, which is often
@dsullivan7
dsullivan7 added a line comment Oct 20, 2014

"results is a smoother fitted function"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dsullivan7 dsullivan7 commented on an outdated diff Oct 20, 2014
sklearn/neighbors/base.py
"""Get the weights from an array of distances and a parameter ``weights``
Parameters
===========
dist: ndarray
The input distances
- weights: {'uniform', 'distance' or a callable}
- The kind of weighting used
+ weights: None, callable or string
+ The kind of weighting used. The valid string parameters are 'uniform',
+ 'distance', 'tophat', 'gaussian', 'epanechnikov', 'exponential',
+ 'linear', 'cosine'.
+ bandwidth: float or ndarray
+ The kernel function bandwidth (only for kernel weighting).
+ If float, then the bandwidth is the same for all queries.
+ If ndarray, then i-th element is used as a bandwidth for
+ i-th query.
+
@dsullivan7
dsullivan7 added a line comment Oct 20, 2014

"then the i-th element is used as a bandwidth for the i-th query"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@nmayorov

@dsullivan7 thanks very much for your input.

@agramfort agramfort commented on an outdated diff Oct 21, 2014
sklearn/neighbors/tests/test_neighbors.py
@@ -1,3 +1,5 @@
+'gaussian', 'epanechnikov', 'exponential', 'linear', 'cosine'
@agramfort
scikit-learn member
agramfort added a line comment Oct 21, 2014

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort agramfort commented on an outdated diff Oct 21, 2014
sklearn/neighbors/tests/test_neighbors.py
@@ -567,7 +580,8 @@ def test_RadiusNeighborsRegressor_multioutput(n_samples=40,
y = np.vstack([y, y]).T
y_target = y[:n_test_pts]
- weights = ['uniform', 'distance', _weight_func]
+ weights = [None, 'uniform', 'distance', _weight_func, 'gaussian',
+ 'epanechnikov', 'exponential', 'linear', 'cosine']
@agramfort
scikit-learn member
agramfort added a line comment Oct 21, 2014

just define a global WEIGHTS variable and avoid this duplication of

[None, 'uniform', 'distance', _weight_func, 'gaussian', 'epanechnikov', 'exponential', 'linear', 'cosine']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort agramfort commented on an outdated diff Oct 21, 2014
sklearn/neighbors/base.py
@@ -403,6 +429,36 @@ def kneighbors_graph(self, X, n_neighbors=None,
return csr_matrix((A_data.ravel(), A_ind.ravel(), A_indptr),
shape=(n_samples1, n_samples2))
+ def _get_neighbors_and_weights(self, X):
+ """Find neighbors to X and assign weights to them according to
+ class parameters.
+
+ Parameters
+ ----------
+ X : array of shape [n_samples, n_features]
+ A 2-D array representing the set of points.
+
+ Returns
+ -------
+ dist : array of shape [n_samples, self.n_neighbors]
@agramfort
scikit-learn member
agramfort added a line comment Oct 21, 2014

docstring formatting should be:

dist : array, shape (n_samples, self.n_neighbors)

please check all docstrings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort agramfort and 2 others commented on an outdated diff Oct 21, 2014
doc/modules/neighbors.rst
@@ -224,8 +235,15 @@ to the regression than faraway points. This can be accomplished through
the ``weights`` keyword. The default value, ``weights = 'uniform'``,
assigns equal weights to all points. ``weights = 'distance'`` assigns
weights proportional to the inverse of the distance from the query point.
-Alternatively, a user-defined function of the distance can be supplied,
-which will be used to compute the weights.
+Other allowed values for ``weights`` are names of :ref:`kernels <kernels>`:
+``'tophat'``, ``'gaussian'``, ``'epanechnikov'``, ``'exponential'``,
+``'linear'``, ``'cosine'``. The bandwidth of a kernel is equal to the distance
+to :math:`k + 1` neighbor for :class:`KNeighborsRegressor` and to the radius
+:math:`r` for :class:`RadiusNeighborsRegressor`. Using kernels for nearest
+neighbor regression results in smoother fitted function, which is often
+desirable. Alternatively, a user-defined function of the distance can be
+supplied, which will be used to compute the weights.
@agramfort
scikit-learn member
agramfort added a line comment Oct 21, 2014

you should explain in the doc why one might want to use a fancy kernel?
the doc is meant to provide guidelines and not be just descriptive.

@nmayorov
nmayorov added a line comment Oct 21, 2014

Well, I don't see any discussion about different kernels in kernel density estimation section. And, as another example, in the section about SVM nothing is told about how to chose a (feature mapping) kernel.

I mean that often it is hard to give general recommendations in machine learning. I would say that all kernels (except 'tophat') are very similar, on a particular data set one or another might work better.

Why do we need different kernels in kNN than? I give two arguments:

  1. Kernels perform different depending on data. And we don't know which one would be the best a priori. On LandSat data the accuracy increase with the best kernel was about 2 percent compared to 'uniform' and 'dist', which is not bad.

  2. As ‘gaussian’, ’tophat’, ’epanechnikov’, ’exponential’, ’linear’ and ’cosine’ are implemented in KernelDensity it would be only logical to implement them as weights as kNN.


Being more constructive: maybe some example which shows the accuracy increase with kernels will do?

Сan you suggest something to write about motivation of using kernel weights for kNN classification / regression?

Meanwhile I will think about that.

@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a line comment Oct 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Nikolay Mayorov added some commits Oct 21, 2014
@GaelVaroquaux GaelVaroquaux and 1 other commented on an outdated diff Oct 21, 2014
sklearn/neighbors/base.py
return dist
elif callable(weights):
return weights(dist)
else:
- raise ValueError("weights not recognized: should be 'uniform', "
- "'distance', or a callable function")
+ kernel = KERNEL_WEIGHTS[weights]
+ if dist.dtype == np.ndarray: # when dist is array of arrays
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a line comment Oct 21, 2014

I am really not confortable with such code. Remind me when we need to support arrays of arrays?

@nmayorov
nmayorov added a line comment Oct 21, 2014

Unfortunately it occurs with RadiusNeighborsClassifier: the number of neighbors is not fixed and numpy can't convert an array of different size arrays to a 2-D array. I don't know what we can do with that.

@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a line comment Oct 22, 2014
@nmayorov
nmayorov added a line comment Oct 23, 2014

Well, radius_neighbors public method returns exactly such structure (And what can be done about that?) I will add a comment.

@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a line comment Oct 23, 2014
@nmayorov
nmayorov added a line comment Oct 23, 2014

(It was the rhetoric question.) On a positive note: I updated doc string of this method. Hope now it is more clear what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on the diff Oct 21, 2014
sklearn/neighbors/base.py
+ Array representing the distances from points to neighbors.
+
+ ind : array of shape [n_samples, self.n_neighbors]
+ Indices of the nearest neighbors.
+
+ weights : array of shape [n_samples, self.n_neighbors]
+ Weights assigned to neighbors.
+ """
+ if self.weights in KERNEL_WEIGHTS:
+ dist, ind = self.kneighbors(X, n_neighbors=self.n_neighbors + 1)
+ bandwidth = dist[:, -1].ravel()
+ dist, ind = dist[:, :-1], ind[:, :-1]
+ weights = _get_weights(dist, self.weights, bandwidth=bandwidth)
+ else:
+ dist, ind = self.kneighbors(X)
+ weights = _get_weights(dist, self.weights)
@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a line comment Oct 21, 2014

Remind me: is there a check in the code path that will raise an error if the user mistyped the name of a kernel? Looking at this code path is not obvious to me.

@nmayorov
nmayorov added a line comment Oct 21, 2014

Now it goes like this: _get_weights -> _check_weights. There wasn't a check for weights in __init__, and I didn't add it.

I don't completely understand the policy of parameters check in __init__. According to guidelines nothing should be done / checked in __init__, because of cloning and set_params. But in some classes (like KNeighborsClassifier) some checks are performed regardless.

@GaelVaroquaux
scikit-learn member
GaelVaroquaux added a line comment Oct 22, 2014

OK, that's fair enough. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux
scikit-learn member

Thank you. This is really a great contribution. I have the feeling that we are almost ready for a merge. Just a few minor things to address.

@nmayorov

Thank you for kind words! It really helps with motivation.

@agramfort agramfort commented on an outdated diff Oct 22, 2014
doc/modules/neighbors.rst
``weights = 'distance'`` assigns weights proportional to the inverse of the
-distance from the query point. Alternatively, a user-defined function of the
-distance can be supplied which is used to compute the weights.
-
+distance from the query point. Other allowed values for ``weights`` are
+:ref:`kernels <kernels>`: ``'tophat'``, ``'gaussian'``, ``'epanechnikov'``,
+``'exponential'``, ``'linear'``, ``'cosine'``. The bandwidth of a kernel is
+equal to the distance to the :math:`k + 1` neighbor for
+:class:`KNeighborsClassifier` and to the radius :math:`r` for
+:class:`RadiusNeighborsClassifier`. The sum of kernel-weighted votes for
+a class is proportional to the probability density for this class estimated
+with the kernel, and the class with the highest probability density is
+picked. Alternatively, a user-defined function of the distance can be supplied
+which is used to compute the weights.
@agramfort
scikit-learn member
agramfort added a line comment Oct 22, 2014

I would add a note about smoothness when using kernel here too.

thanks @nmayorov you're almost there !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@nmayorov

Hey, @dsullivan7 Could you please clarify which are correct variants:
1.

    X : array-like
        Input data.
or
    X : array-like
        The input data.

2.

    y : array-like
        Predicted values.
or
    y : array-like
        The predicted values.

Because we use "the" with singulars (i.e. "The radius of neighbors search") it seems that we also should use "the" with plurals, but I'm not sure (it sounds better without).

@dsullivan7

@nmayorov I agree with you. Exclude the "the" in this case.

@nmayorov

I added two short paragraphs about weights parameter choosing. I'm not entirely happy with them, but it is something to start with. Be so kind to review them.

Also modification of plot_regression.py is somewhat controversial (I think.)

I believe that some example comparing effectiveness of different weights in classification should also be provided. (Maybe based on existing plot_classification.py in a similar fashion with plot_regression.py?)

Nikolay Mayorov added some commits Oct 23, 2014
@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling ade728f on nmayorov:neighbors_kernels into 3a02da8 on scikit-learn:master.

@nmayorov nmayorov changed the title from [WIP] Added kernel weighting functions for neighbors classes to [MRG] Added kernel weighting functions for neighbors classes Oct 29, 2014
@nmayorov

About examples comparing performance of weighting schemes. I decided not to add them because of the following reasons:

  1. On toy / synthetic data sets it is misleading and deceiving. Results depend mostly on train / test split and not on actual weighting schemes.

  2. Doing it on a real data set doesn't seem to fit into docs. (There are no such boring comparisons for other methods.) Also I'm struggling to find suitable data sets among presented in scikit-learn.

Also that's the reason I removed MSE estimations (previously added by me) from plot_regression.py (They are rather meaningless.)


OK, would you guys to do the final review, @agramfort @GaelVaroquaux please.

Nikolay Mayorov Tiny doc fix 17391d3
@agramfort
scikit-learn member

I am bit lost. You posted at some point results demonstrating some benefit of these new kernels. Are you saying that none of the datasets we commonly use backup this claim?

there are some scripts which go beyond simple examples in examples/applications/

@nmayorov

You are right, it sounds confusing. I'm a bit lost myself.

I demonstrated 2% accuracy increase when using kernel weights in data set, containing 4435 train samples and 2000 test samples, This result is statistically significant I believe (and it is a real life example.). But when I experimented with iris and digits data set I found that there was no clear benefit of using weights different than uniform. Iris is too small and the accuracy mostly depends on train / test split. In digits the best results are obtained by 1 nearest neighbor, so the weights are irrelevant.

Experiments with small synthetic data sets also show that the accuracy changes significantly with different train / test splits. And I don't want to delude a user by choosing a "proper" random seed. I may continue looking into this direction though.

We need three properties for a data set: it's a classification problem, it's big enough, it's from real life. I don't think that synthetic data sets are that interesting.

Maybe I can add the fetch function for https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) ?

@agramfort
scikit-learn member
@nmayorov

Hi!

I experimented more with different weights. I have to admit that I overstated their influence. The accuracy boost was only 1% (not 2), and again it depends on train / test split. In general it gives an improvement within 1% range. So it is somewhat useful, but not very.

I think it's not worth it to add 5 new weights, because they are rather similar and give only marginal improvements.

But I think some scheme called 'distance+' might be added. It can be a linear kernel, or a scheme described in http://www.joics.com/publishedpapers/2012_9_6_1429_1436.pdf I suspect that their train / test splits weren't completely random. But no doubt the proposed scheme gives some improvement.

Please give me your opinion: should I continue this PR or create a new one with a single 'distance+' scheme? Or maybe neither of that.

@agramfort
scikit-learn member
@nmayorov

So what I've done:

  1. Removed all kernels but 'linear'. It's the most simple and theoretically sound option for weighting.

  2. Added fetch_landsat.

  3. Added example comparing accuracy on landsat for different n_neighbors.

  4. Shortened additions in rst doc

@agramfort
scikit-learn member

I have no time to look. Somebody please review.

@nmayorov

@agramfort maybe you could do it later? @GaelVaroquaux, would be great if you join.

@nmayorov

Hi!

If you think this PR is not worth including in the project, you can close it. Otherwise, I'm ready to continue working on it. I don't mind both variants.

@amueller
scikit-learn member

Sorry for the lack of feedback. This could be interesting. Do you have any relevant paper references?

@nmayorov

The main reference would be "The elements of statistical learning", chapter 6.

@amueller
scikit-learn member

The problem with using this as a reference is that it is hard to tell if people find it valuable in practice ;)

@nmayorov

The situation here is the same as with kernels for KDE. Look at different shapes of kernels. They all work very similarly, but it's impossible to choose one "best" kernel, thus let's have some variety.

Initially I wanted to add all kernels presented in KDE for consistency, because NN classification is kernel density estimation. But I noticed that they give marginal improvement over standard NN and decided to keep only triangular kernel as the most "straightforward". But it surely can give about +1% accuracy on some datasets, I added an example on landsat dataset.

@amueller
scikit-learn member

Thanks for your comments.
Sorry, we are a bit overwhelmed with PRs at the moment, and will focus on bugfixes for the upcoming release.
I'll try to look at this in more detail soon.

@nmayorov

Thanks for taking interest!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment