Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Added kernel weighting functions for neighbors classes #3117

Closed
wants to merge 27 commits into from

Conversation

nmayorov
Copy link
Contributor

This patch enables the use of kernel functions for neighbors weighting.

It adds the following keywords for weights argument: tophat, gaussian, epanechnikov, exponential, linear, cosine, i.e. all kernels presented in KernelDensity class.

For KNeighborsClassifier and KNeighborsRegressor the kernel bandwidth is equal to the distance to the k+1 nearest neighbor (i. e. it depends on a query point).

For RadiusNeighborsClassifier and RadiusNeighborsRegressor the kernel bandwidth is equal to the radius parameter of the classifier (i. e. it is constant).

Please, take a look.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 9d3f813 on nmayorov:neighbors_kernels into 6945d5b on scikit-learn:master.

@nmayorov nmayorov changed the title Added kernel weighting functions for neighbors classes [WIP] Added kernel weighting functions for neighbors classes Oct 7, 2014
@nmayorov
Copy link
Contributor Author

nmayorov commented Oct 7, 2014

Hello! This has been here for months, no attention unfortunately.

Let me try to explain what the intention of this PR in more detail.

Currently there are two options for weighting neighbor predictions: uniform (majority vote) and dist, which uses 1/dist weights. The first one is classic, the second one is quite controversial (infinity which might occur is not fun to deal with, I'm not sure if it's a good option to be honest.)

There is also a probabilistic interpretation on neighbor methods, which manifests itself in sklearn.neighbors.KernelDensity. We can also use it for prediction in kNN: estimate the PDF for each class at a query point and then pick one with the highest probability (Bayesian approach). It can be very easily done by using kernels (as in kernel density estimation) as weighting functions.

One subtle point is that some kernel functions (like gaussian) are non-zero in infinite interval, in kNN prediction we have to use their "truncated" versions. But I don't think it matters much in practice. As far as selected kernel bandwidth concerned, please refer to my opening message.

Other neighbor weighting strategies also exist, which aren't directly associated with kernel density estimation. Potentially we can also incorporate them into sklearn.neighbors. Overall I think there should be more options besides uniform and dist.


Please tell me whether you think it is useful or not. I'm willing to properly finish this PR (like add narrative doc and so on.)

Ping @agramfort @jakevdp @larsman Anyone?

@agramfort
Copy link
Member

can you provide some benchmark results that demonstrate the usefulness of this on a public dataset? in terms of accuracy and computation time.

thanks

@nmayorov
Copy link
Contributor Author

Hi, Alexandre.

I created an ipython notebook where I test different weighs on the famous data set. Take a look http://nbviewer.ipython.org/gist/nmayorov/9b11161f9b66df12d2b9.

@agramfort
Copy link
Member

ok good. Can you comment on extra computation time if any significant?
How long is test time?
You'll need to add a paragraph to the narrative docs explaining the kernels
and why one might want to use them.

@nmayorov
Copy link
Contributor Author

It does not required any significant extra time, it's simply a matter of evaluation of a different weighting function (as 1 / dist). I added benchmarks in ipython notebook.

Also I remembered one thing: such technique for regression is known by the name of Nadaraya-Watson estimator. And in fact there is the whole chapter about similar methods in "The elements of statistical learning" (for example, check out FIGURE 6.1 from there, pretty illustrative.)

With a proper kernel (non-zero only in the range of bandwidth) we can do this regression locally using only small number of neighbors. Perhaps we should keep kernels which are non-zero only locally in the bandwidth range, to have theoretical integrity. What do you think?

About narrative docs. I think I'll just mention that it can be interpreted as KDE for classification and Nadaraya-Watson estimation for regression, but won't go deep into that (also I don't think I can.) After all this is just a few new reasonable weighting functions, which gives more credit to closer neighbors.

Give me some feedback.

@arjoly
Copy link
Member

arjoly commented Oct 14, 2014

ping @jakevdp You might want have a look to this pr.

"""Get the weights from an array of distances and a parameter ``weights``

Parameters
===========
dist: ndarray
The input distances
weights: {'uniform', 'distance' or a callable}
weights: None, string from VALID_WEIGHTS or callable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write

weights: None, str or callable
    The kind of weighting used. The valid string parameters are
    'uniform', 'distance', 'tophat', etc....

the mathematical formula of the different kernels should be in the narrative
doc ideally with a plot of the kernel shapes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a private function of the module, I though I can be more technically explicit and mention VALID_WEIGHTS. (Makes sense?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was explicit and clear before please keep it clear and explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you.

@agramfort
Copy link
Member

you need to add a paragraph to the narrative doc and update an example to show case this feature.

return dist
elif callable(weights):
return weights(dist)
else:
raise ValueError("weights not recognized: should be 'uniform', "
"'distance', or a callable function")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a nice error message now it's gone... please step into the shoes of the user that messes up the name of the kernel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is checked in _check_weights, previously there was a duplication. The error message will appear all right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job redesigning this!

@arjoly arjoly mentioned this pull request Oct 17, 2014
@nmayorov
Copy link
Contributor Author

I've done some work, please review.

@@ -34,16 +34,16 @@
# Fit regression model
n_neighbors = 5

for i, weights in enumerate(['uniform', 'distance']):
plt.figure(figsize=(8, 9))
for i, weights in enumerate(['uniform', 'distance', 'epanechnikov']):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running this example it seems that epanechnikov is a kernel in this list that does not force the line to go through the training point.

it terms of user understanding this point should be explained in the doc.

figure_1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say that it is a characteristic property of smoothing kernels, The estimate with smoothing kernels is less bumpy and more smooth than with uniform weights and that's all. (Why did you imply that 'uniform' forces the line to go through training points?)

And only 'distance' shows this weird property, that the line has to pass through every training point. (Which again the argument not to use it all.)

@nmayorov
Copy link
Contributor Author

About examples comparing performance of weighting schemes. I decided not to add them because of the following reasons:

  1. On toy / synthetic data sets it is misleading and deceiving. Results depend mostly on train / test split and not on actual weighting schemes.
  2. Doing it on a real data set doesn't seem to fit into docs. (There are no such boring comparisons for other methods.) Also I'm struggling to find suitable data sets among presented in scikit-learn.

Also that's the reason I removed MSE estimations (previously added by me) from plot_regression.py (They are rather meaningless.)


OK, would you guys to do the final review, @agramfort @GaelVaroquaux please.

@agramfort
Copy link
Member

I am bit lost. You posted at some point results demonstrating some benefit of these new kernels. Are you saying that none of the datasets we commonly use backup this claim?

there are some scripts which go beyond simple examples in examples/applications/

@nmayorov
Copy link
Contributor Author

You are right, it sounds confusing. I'm a bit lost myself.

I demonstrated 2% accuracy increase when using kernel weights in data set, containing 4435 train samples and 2000 test samples, This result is statistically significant I believe (and it is a real life example.). But when I experimented with iris and digits data set I found that there was no clear benefit of using weights different than uniform. Iris is too small and the accuracy mostly depends on train / test split. In digits the best results are obtained by 1 nearest neighbor, so the weights are irrelevant.

Experiments with small synthetic data sets also show that the accuracy changes significantly with different train / test splits. And I don't want to delude a user by choosing a "proper" random seed. I may continue looking into this direction though.

We need three properties for a data set: it's a classification problem, it's big enough, it's from real life. I don't think that synthetic data sets are that interesting.

Maybe I can add the fetch function for https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) ?

@agramfort
Copy link
Member

+1 for fetching a more convincing dataset.

@nmayorov
Copy link
Contributor Author

nmayorov commented Nov 2, 2014

Hi!

I experimented more with different weights. I have to admit that I overstated their influence. The accuracy boost was only 1% (not 2), and again it depends on train / test split. In general it gives an improvement within 1% range. So it is somewhat useful, but not very.

I think it's not worth it to add 5 new weights, because they are rather similar and give only marginal improvements.

But I think some scheme called 'distance+' might be added. It can be a linear kernel, or a scheme described in http://www.joics.com/publishedpapers/2012_9_6_1429_1436.pdf I suspect that their train / test splits weren't completely random. But no doubt the proposed scheme gives some improvement.

Please give me your opinion: should I continue this PR or create a new one with a single 'distance+' scheme? Or maybe neither of that.

@agramfort
Copy link
Member

I experimented more with different weights. I have to admit that I overstated their influence. The accuracy boost was only 1% (not 2), and again it depends on train / test split. In general it gives the improvement withing 1% range (on landsat). So it is somewhat useful, but not very.

ok.

I think it's not worth it to add 5 new weights, because they are rather similar and give only marginal improvements.

indeed.

But I think some scheme called 'distance+' might be added. It can be a linear kernel, or a scheme described in http://www.joics.com/publishedpapers/2012_9_6_1429_1436.pdf I suspect that their train / test splits weren't completely random. But no doubt the proposed scheme gives some improvement.

you'll need to quantify the improvement. Also remember that what you
add should be textbook material or from a highly cited paper.

Please give me your opinion: should I continue this PR or create a new one with a single 'distance+' scheme? Or maybe neither of that.

same PR is good.

@nmayorov
Copy link
Contributor Author

nmayorov commented Nov 3, 2014

So what I've done:

  1. Removed all kernels but 'linear'. It's the most simple and theoretically sound option for weighting.
  2. Added fetch_landsat.
  3. Added example comparing accuracy on landsat for different n_neighbors.
  4. Shortened additions in rst doc

@agramfort
Copy link
Member

I have no time to look. Somebody please review.

@nmayorov
Copy link
Contributor Author

nmayorov commented Nov 4, 2014

@agramfort maybe you could do it later? @GaelVaroquaux, would be great if you join.

@nmayorov
Copy link
Contributor Author

Hi!

If you think this PR is not worth including in the project, you can close it. Otherwise, I'm ready to continue working on it. I don't mind both variants.

@amueller
Copy link
Member

Sorry for the lack of feedback. This could be interesting. Do you have any relevant paper references?

@nmayorov
Copy link
Contributor Author

The main reference would be "The elements of statistical learning", chapter 6.

@amueller
Copy link
Member

The problem with using this as a reference is that it is hard to tell if people find it valuable in practice ;)

@nmayorov
Copy link
Contributor Author

The situation here is the same as with kernels for KDE. Look at different shapes of kernels. They all work very similarly, but it's impossible to choose one "best" kernel, thus let's have some variety.

Initially I wanted to add all kernels presented in KDE for consistency, because NN classification is kernel density estimation. But I noticed that they give marginal improvement over standard NN and decided to keep only triangular kernel as the most "straightforward". But it surely can give about +1% accuracy on some datasets, I added an example on landsat dataset.

@amueller
Copy link
Member

Thanks for your comments.
Sorry, we are a bit overwhelmed with PRs at the moment, and will focus on bugfixes for the upcoming release.
I'll try to look at this in more detail soon.

@nmayorov
Copy link
Contributor Author

Thanks for taking interest!

@cmarmo cmarmo added Needs Decision Requires decision and removed Waiting for Reviewer labels Sep 29, 2020
Base automatically changed from master to main January 22, 2021 10:48
@haiatn
Copy link
Contributor

haiatn commented Jul 29, 2023

Is this PR just waiting for review? Or are we still doubting if this is needed?

@adrinjalali
Copy link
Member

Given the lack of requests for this in recent years I think we can close this. Happy to include it if somehow it comes up fresh. Thanks for the work you put into this @nmayorov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants