-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Added kernel weighting functions for neighbors classes #3117
Conversation
Hello! This has been here for months, no attention unfortunately. Let me try to explain what the intention of this PR in more detail. Currently there are two options for weighting neighbor predictions: There is also a probabilistic interpretation on neighbor methods, which manifests itself in One subtle point is that some kernel functions (like Other neighbor weighting strategies also exist, which aren't directly associated with kernel density estimation. Potentially we can also incorporate them into Please tell me whether you think it is useful or not. I'm willing to properly finish this PR (like add narrative doc and so on.) Ping @agramfort @jakevdp @larsman Anyone? |
can you provide some benchmark results that demonstrate the usefulness of this on a public dataset? in terms of accuracy and computation time. thanks |
Hi, Alexandre. I created an ipython notebook where I test different weighs on the famous data set. Take a look http://nbviewer.ipython.org/gist/nmayorov/9b11161f9b66df12d2b9. |
ok good. Can you comment on extra computation time if any significant? |
It does not required any significant extra time, it's simply a matter of evaluation of a different weighting function (as 1 / dist). I added benchmarks in ipython notebook. Also I remembered one thing: such technique for regression is known by the name of Nadaraya-Watson estimator. And in fact there is the whole chapter about similar methods in "The elements of statistical learning" (for example, check out FIGURE 6.1 from there, pretty illustrative.) With a proper kernel (non-zero only in the range of bandwidth) we can do this regression locally using only small number of neighbors. Perhaps we should keep kernels which are non-zero only locally in the bandwidth range, to have theoretical integrity. What do you think? About narrative docs. I think I'll just mention that it can be interpreted as KDE for classification and Nadaraya-Watson estimation for regression, but won't go deep into that (also I don't think I can.) After all this is just a few new reasonable weighting functions, which gives more credit to closer neighbors. Give me some feedback. |
ping @jakevdp You might want have a look to this pr. |
sklearn/neighbors/base.py
Outdated
"""Get the weights from an array of distances and a parameter ``weights`` | ||
|
||
Parameters | ||
=========== | ||
dist: ndarray | ||
The input distances | ||
weights: {'uniform', 'distance' or a callable} | ||
weights: None, string from VALID_WEIGHTS or callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write
weights: None, str or callable
The kind of weighting used. The valid string parameters are
'uniform', 'distance', 'tophat', etc....
the mathematical formula of the different kernels should be in the narrative
doc ideally with a plot of the kernel shapes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a private function of the module, I though I can be more technically explicit and mention VALID_WEIGHTS
. (Makes sense?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was explicit and clear before please keep it clear and explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you.
you need to add a paragraph to the narrative doc and update an example to show case this feature. |
return dist | ||
elif callable(weights): | ||
return weights(dist) | ||
else: | ||
raise ValueError("weights not recognized: should be 'uniform', " | ||
"'distance', or a callable function") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a nice error message now it's gone... please step into the shoes of the user that messes up the name of the kernel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is checked in _check_weights, previously there was a duplication. The error message will appear all right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job redesigning this!
I've done some work, please review. |
@@ -34,16 +34,16 @@ | |||
# Fit regression model | |||
n_neighbors = 5 | |||
|
|||
for i, weights in enumerate(['uniform', 'distance']): | |||
plt.figure(figsize=(8, 9)) | |||
for i, weights in enumerate(['uniform', 'distance', 'epanechnikov']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say that it is a characteristic property of smoothing kernels, The estimate with smoothing kernels is less bumpy and more smooth than with uniform weights and that's all. (Why did you imply that 'uniform' forces the line to go through training points?)
And only 'distance' shows this weird property, that the line has to pass through every training point. (Which again the argument not to use it all.)
About examples comparing performance of weighting schemes. I decided not to add them because of the following reasons:
Also that's the reason I removed MSE estimations (previously added by me) from plot_regression.py (They are rather meaningless.) OK, would you guys to do the final review, @agramfort @GaelVaroquaux please. |
I am bit lost. You posted at some point results demonstrating some benefit of these new kernels. Are you saying that none of the datasets we commonly use backup this claim? there are some scripts which go beyond simple examples in examples/applications/ |
You are right, it sounds confusing. I'm a bit lost myself. I demonstrated 2% accuracy increase when using kernel weights in data set, containing 4435 train samples and 2000 test samples, This result is statistically significant I believe (and it is a real life example.). But when I experimented with iris and digits data set I found that there was no clear benefit of using weights different than uniform. Iris is too small and the accuracy mostly depends on train / test split. In digits the best results are obtained by 1 nearest neighbor, so the weights are irrelevant. Experiments with small synthetic data sets also show that the accuracy changes significantly with different train / test splits. And I don't want to delude a user by choosing a "proper" random seed. I may continue looking into this direction though. We need three properties for a data set: it's a classification problem, it's big enough, it's from real life. I don't think that synthetic data sets are that interesting. Maybe I can add the fetch function for https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) ? |
+1 for fetching a more convincing dataset. |
Hi! I experimented more with different weights. I have to admit that I overstated their influence. The accuracy boost was only 1% (not 2), and again it depends on train / test split. In general it gives an improvement within 1% range. So it is somewhat useful, but not very. I think it's not worth it to add 5 new weights, because they are rather similar and give only marginal improvements. But I think some scheme called 'distance+' might be added. It can be a linear kernel, or a scheme described in http://www.joics.com/publishedpapers/2012_9_6_1429_1436.pdf I suspect that their train / test splits weren't completely random. But no doubt the proposed scheme gives some improvement. Please give me your opinion: should I continue this PR or create a new one with a single 'distance+' scheme? Or maybe neither of that. |
ok.
indeed.
you'll need to quantify the improvement. Also remember that what you
same PR is good. |
So what I've done:
|
I have no time to look. Somebody please review. |
@agramfort maybe you could do it later? @GaelVaroquaux, would be great if you join. |
Hi! If you think this PR is not worth including in the project, you can close it. Otherwise, I'm ready to continue working on it. I don't mind both variants. |
Sorry for the lack of feedback. This could be interesting. Do you have any relevant paper references? |
The main reference would be "The elements of statistical learning", chapter 6. |
The problem with using this as a reference is that it is hard to tell if people find it valuable in practice ;) |
The situation here is the same as with kernels for KDE. Look at different shapes of kernels. They all work very similarly, but it's impossible to choose one "best" kernel, thus let's have some variety. Initially I wanted to add all kernels presented in KDE for consistency, because NN classification is kernel density estimation. But I noticed that they give marginal improvement over standard NN and decided to keep only triangular kernel as the most "straightforward". But it surely can give about +1% accuracy on some datasets, I added an example on landsat dataset. |
Thanks for your comments. |
Thanks for taking interest! |
Is this PR just waiting for review? Or are we still doubting if this is needed? |
Given the lack of requests for this in recent years I think we can close this. Happy to include it if somehow it comes up fresh. Thanks for the work you put into this @nmayorov |
This patch enables the use of kernel functions for neighbors weighting.
It adds the following keywords for
weights
argument:tophat
,gaussian
,epanechnikov
,exponential
,linear
,cosine
, i.e. all kernels presented inKernelDensity
class.For
KNeighborsClassifier
andKNeighborsRegressor
the kernel bandwidth is equal to the distance to the k+1 nearest neighbor (i. e. it depends on a query point).For
RadiusNeighborsClassifier
andRadiusNeighborsRegressor
the kernel bandwidth is equal to the radius parameter of the classifier (i. e. it is constant).Please, take a look.