Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

[MRG] Added kernel weighting functions for neighbors classes #3117

Open
wants to merge 27 commits into from

7 participants

@nmayorov

This patch enables the use of kernel functions for neighbors weighting.

It adds the following keywords for weights argument: tophat, gaussian, epanechnikov, exponential, linear, cosine, i.e. all kernels presented in KernelDensity class.

For KNeighborsClassifier and KNeighborsRegressor the kernel bandwidth is equal to the distance to the k+1 nearest neighbor (i. e. it depends on a query point).

For RadiusNeighborsClassifier and RadiusNeighborsRegressor the kernel bandwidth is equal to the radius parameter of the classifier (i. e. it is constant).

Please, take a look.

@coveralls

Coverage Status

Coverage remained the same when pulling 9d3f813 on nmayorov:neighbors_kernels into 6945d5b on scikit-learn:master.

@nmayorov nmayorov changed the title from Added kernel weighting functions for neighbors classes to [WIP] Added kernel weighting functions for neighbors classes
@nmayorov

Hello! This has been here for months, no attention unfortunately.

Let me try to explain what the intention of this PR in more detail.

Currently there are two options for weighting neighbor predictions: uniform (majority vote) and dist, which uses 1/dist weights. The first one is classic, the second one is quite controversial (infinity which might occur is not fun to deal with, I'm not sure if it's a good option to be honest.)

There is also a probabilistic interpretation on neighbor methods, which manifests itself in sklearn.neighbors.KernelDensity. We can also use it for prediction in kNN: estimate the PDF for each class at a query point and then pick one with the highest probability (Bayesian approach). It can be very easily done by using kernels (as in kernel density estimation) as weighting functions.

One subtle point is that some kernel functions (like gaussian) are non-zero in infinite interval, in kNN prediction we have to use their "truncated" versions. But I don't think it matters much in practice. As far as selected kernel bandwidth concerned, please refer to my opening message.

Other neighbor weighting strategies also exist, which aren't directly associated with kernel density estimation. Potentially we can also incorporate them into sklearn.neighbors. Overall I think there should be more options besides uniform and dist.


Please tell me whether you think it is useful or not. I'm willing to properly finish this PR (like add narrative doc and so on.)

Ping @agramfort @jakevdp @larsman Anyone?

@agramfort
Owner

can you provide some benchmark results that demonstrate the usefulness of this on a public dataset? in terms of accuracy and computation time.

thanks

@nmayorov

Hi, Alexandre.

I created an ipython notebook where I test different weighs on the famous data set. Take a look http://nbviewer.ipython.org/gist/nmayorov/9b11161f9b66df12d2b9.

@agramfort
Owner

ok good. Can you comment on extra computation time if any significant?
How long is test time?
You'll need to add a paragraph to the narrative docs explaining the kernels
and why one might want to use them.

@nmayorov

It does not required any significant extra time, it's simply a matter of evaluation of a different weighting function (as 1 / dist). I added benchmarks in ipython notebook.

Also I remembered one thing: such technique for regression is known by the name of Nadaraya-Watson estimator. And in fact there is the whole chapter about similar methods in "The elements of statistical learning" (for example, check out FIGURE 6.1 from there, pretty illustrative.)

With a proper kernel (non-zero only in the range of bandwidth) we can do this regression locally using only small number of neighbors. Perhaps we should keep kernels which are non-zero only locally in the bandwidth range, to have theoretical integrity. What do you think?

About narrative docs. I think I'll just mention that it can be interpreted as KDE for classification and Nadaraya-Watson estimation for regression, but won't go deep into that (also I don't think I can.) After all this is just a few new reasonable weighting functions, which gives more credit to closer neighbors.

Give me some feedback.

@arjoly
Owner

ping @jakevdp You might want have a look to this pr.

sklearn/neighbors/base.py
((19 lines not shown))
"""Get the weights from an array of distances and a parameter ``weights``
Parameters
===========
dist: ndarray
The input distances
- weights: {'uniform', 'distance' or a callable}
+ weights: None, string from VALID_WEIGHTS or callable
@agramfort Owner

I would write

weights: None, str or callable
    The kind of weighting used. The valid string parameters are
    'uniform', 'distance', 'tophat', etc....

the mathematical formula of the different kernels should be in the narrative
doc ideally with a plot of the kernel shapes.

It is a private function of the module, I though I can be more technically explicit and mention VALID_WEIGHTS. (Makes sense?)

@agramfort Owner

it was explicit and clear before please keep it clear and explicit.

Got you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/classification.py
@@ -187,7 +198,16 @@ def predict_proba(self, X):
"""
X = atleast2d_or_csr(X)
- neigh_dist, neigh_ind = self.kneighbors(X)
+ if self.weights in KERNEL_WEIGHTS:
+ neigh_dist, neigh_ind = \
+ self.kneighbors(X, n_neighbors=self.n_neighbors + 1)
+ bandwidth = neigh_dist[:, -1]
+ neigh_dist, neigh_ind = neigh_dist[:, :-1], neigh_ind[:, :-1]
+ weights = _get_weights(neigh_dist, self.weights,
+ bandwidth=bandwidth)
@agramfort Owner

this block of lines seems to be duplicated a few times. Why not updating _get_weights to avoid these duplications?

Duplications are not good, but I don't see a better way here.

For bandwidth in _get_weights we can pass an array (for NearestNeighbor) or single value (for RadiusNeighbor) — the usage is case dependent. Also there is small logic for discarding the last neigh_ind row.

I can modify KNeighborsMixin.kneighbors() such that it will return neigh_dist, neigh_ind and bandwidth (depending on self.weights.) But it seems bad.

Refactoring this into a new helper function in base also seems excessive.

@agramfort Owner

you can change the API of _get_weights to avoid the duplication.

OK, I'll see to that, but still not sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort
Owner

you need to add a paragraph to the narrative doc and update an example to show case this feature.

sklearn/neighbors/base.py
((50 lines not shown))
return dist
elif callable(weights):
return weights(dist)
else:
- raise ValueError("weights not recognized: should be 'uniform', "
- "'distance', or a callable function")
@agramfort Owner

there was a nice error message now it's gone... please step into the shoes of the user that messes up the name of the kernel

It is checked in _check_weights, previously there was a duplication. The error message will appear all right.

@GaelVaroquaux Owner

Good job redesigning this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@arjoly arjoly referenced this pull request
Closed

[WIP] Kernel regression #3780

@nmayorov

I've done some work, please review.

examples/neighbors/plot_regression.py
@@ -34,16 +34,16 @@
# Fit regression model
n_neighbors = 5
-for i, weights in enumerate(['uniform', 'distance']):
+plt.figure(figsize=(8, 9))
+for i, weights in enumerate(['uniform', 'distance', 'epanechnikov']):
@agramfort Owner

running this example it seems that epanechnikov is a kernel in this list that does not force the line to go through the training point.

it terms of user understanding this point should be explained in the doc.

figure_1

I wouldn't say that it is a characteristic property of smoothing kernels, The estimate with smoothing kernels is less bumpy and more smooth than with uniform weights and that's all. (Why did you imply that 'uniform' forces the line to go through training points?)

And only 'distance' shows this weird property, that the line has to pass through every training point. (Which again the argument not to use it all.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@agramfort
Owner

@dsullivan7 can you review this and help with the English wording? thanks a lot

@dsullivan7

Yes, will do in a bit

doc/modules/neighbors.rst
((6 lines not shown))
``weights = 'distance'`` assigns weights proportional to the inverse of the
-distance from the query point. Alternatively, a user-defined function of the
-distance can be supplied which is used to compute the weights.
-
+distance from the query point. Other allowed values for ``weights`` are names
+of :ref:`kernels <kernels>`: ``'tophat'``, ``'gaussian'``, ``'epanechnikov'``,

take out "names of"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
doc/modules/neighbors.rst
((6 lines not shown))
``weights = 'distance'`` assigns weights proportional to the inverse of the
-distance from the query point. Alternatively, a user-defined function of the
-distance can be supplied which is used to compute the weights.
-
+distance from the query point. Other allowed values for ``weights`` are names
+of :ref:`kernels <kernels>`: ``'tophat'``, ``'gaussian'``, ``'epanechnikov'``,
+``'exponential'``, ``'linear'``, ``'cosine'``. The bandwidth of a kernel is
+equal to the distance to :math:`k + 1` neighbor for
+:class:`KNeighborsClassifier` and to the radius :math:`r` for
+:class:`RadiusNeighborsClassifier`. The sum of weighted by a kernel votes for

Is the bandwidth of a kernel equal to a single distance for KNeighborsClassifier? Like if I choose k=5 the bandwidth is equal to distance of the 6th closest? If that's the case it should be "equal to the distance to the :math:k + 1 neighbor"

Yes, the bandwidth is equal to a single distance. I will add "the" before :math:k + 1 neighbor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
doc/modules/neighbors.rst
((6 lines not shown))
``weights = 'distance'`` assigns weights proportional to the inverse of the
-distance from the query point. Alternatively, a user-defined function of the
-distance can be supplied which is used to compute the weights.
-
+distance from the query point. Other allowed values for ``weights`` are names
+of :ref:`kernels <kernels>`: ``'tophat'``, ``'gaussian'``, ``'epanechnikov'``,
+``'exponential'``, ``'linear'``, ``'cosine'``. The bandwidth of a kernel is
+equal to the distance to :math:`k + 1` neighbor for
+:class:`KNeighborsClassifier` and to the radius :math:`r` for
+:class:`RadiusNeighborsClassifier`. The sum of weighted by a kernel votes for
+a class is proportional to the probability density for this class estimated

I believe this should be "The sum of weighted kernel votes for a class" (remove "by a")

Is "kernel votes" some kind of special term here? Maybe you are right, but it doesn't sound clear to me. We have a set of votes for every class, then we multiply them by values of a kernel function at positions of voting neighbors (so "weight by a kernel".) Do you still think it should be changed?

Ah, I see: weighted-by-a-kernel votes. Hmmm. Perhaps "kernel-weighted votes"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
doc/modules/neighbors.rst
@@ -224,8 +235,15 @@ to the regression than faraway points. This can be accomplished through
the ``weights`` keyword. The default value, ``weights = 'uniform'``,
assigns equal weights to all points. ``weights = 'distance'`` assigns
weights proportional to the inverse of the distance from the query point.
-Alternatively, a user-defined function of the distance can be supplied,
-which will be used to compute the weights.
+Other allowed values for ``weights`` are names of :ref:`kernels <kernels>`:
+``'tophat'``, ``'gaussian'``, ``'epanechnikov'``, ``'exponential'``,
+``'linear'``, ``'cosine'``. The bandwidth of a kernel is equal to the distance
+to :math:`k + 1` neighbor for :class:`KNeighborsRegressor` and to the radius
+:math:`r` for :class:`RadiusNeighborsRegressor`. Using kernels for nearest
+neighbor regression results in smoother fitted function, which is often

"results is a smoother fitted function"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/base.py
((19 lines not shown))
"""Get the weights from an array of distances and a parameter ``weights``
Parameters
===========
dist: ndarray
The input distances
- weights: {'uniform', 'distance' or a callable}
- The kind of weighting used
+ weights: None, callable or string
+ The kind of weighting used. The valid string parameters are 'uniform',
+ 'distance', 'tophat', 'gaussian', 'epanechnikov', 'exponential',
+ 'linear', 'cosine'.
+ bandwidth: float or ndarray
+ The kernel function bandwidth (only for kernel weighting).
+ If float, then the bandwidth is the same for all queries.
+ If ndarray, then i-th element is used as a bandwidth for
+ i-th query.
+

"then the i-th element is used as a bandwidth for the i-th query"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@nmayorov

@dsullivan7 thanks very much for your input.

sklearn/neighbors/tests/test_neighbors.py
@@ -1,3 +1,5 @@
+'gaussian', 'epanechnikov', 'exponential', 'linear', 'cosine'
@agramfort Owner

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/tests/test_neighbors.py
@@ -567,7 +580,8 @@ def test_RadiusNeighborsRegressor_multioutput(n_samples=40,
y = np.vstack([y, y]).T
y_target = y[:n_test_pts]
- weights = ['uniform', 'distance', _weight_func]
+ weights = [None, 'uniform', 'distance', _weight_func, 'gaussian',
+ 'epanechnikov', 'exponential', 'linear', 'cosine']
@agramfort Owner

just define a global WEIGHTS variable and avoid this duplication of

[None, 'uniform', 'distance', _weight_func, 'gaussian', 'epanechnikov', 'exponential', 'linear', 'cosine']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/base.py
@@ -403,6 +429,36 @@ def kneighbors_graph(self, X, n_neighbors=None,
return csr_matrix((A_data.ravel(), A_ind.ravel(), A_indptr),
shape=(n_samples1, n_samples2))
+ def _get_neighbors_and_weights(self, X):
+ """Find neighbors to X and assign weights to them according to
+ class parameters.
+
+ Parameters
+ ----------
+ X : array of shape [n_samples, n_features]
+ A 2-D array representing the set of points.
+
+ Returns
+ -------
+ dist : array of shape [n_samples, self.n_neighbors]
@agramfort Owner

docstring formatting should be:

dist : array, shape (n_samples, self.n_neighbors)

please check all docstrings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
doc/modules/neighbors.rst
@@ -224,8 +235,15 @@ to the regression than faraway points. This can be accomplished through
the ``weights`` keyword. The default value, ``weights = 'uniform'``,
assigns equal weights to all points. ``weights = 'distance'`` assigns
weights proportional to the inverse of the distance from the query point.
-Alternatively, a user-defined function of the distance can be supplied,
-which will be used to compute the weights.
+Other allowed values for ``weights`` are names of :ref:`kernels <kernels>`:
+``'tophat'``, ``'gaussian'``, ``'epanechnikov'``, ``'exponential'``,
+``'linear'``, ``'cosine'``. The bandwidth of a kernel is equal to the distance
+to :math:`k + 1` neighbor for :class:`KNeighborsRegressor` and to the radius
+:math:`r` for :class:`RadiusNeighborsRegressor`. Using kernels for nearest
+neighbor regression results in smoother fitted function, which is often
+desirable. Alternatively, a user-defined function of the distance can be
+supplied, which will be used to compute the weights.
@agramfort Owner

you should explain in the doc why one might want to use a fancy kernel?
the doc is meant to provide guidelines and not be just descriptive.

Well, I don't see any discussion about different kernels in kernel density estimation section. And, as another example, in the section about SVM nothing is told about how to chose a (feature mapping) kernel.

I mean that often it is hard to give general recommendations in machine learning. I would say that all kernels (except 'tophat') are very similar, on a particular data set one or another might work better.

Why do we need different kernels in kNN than? I give two arguments:

  1. Kernels perform different depending on data. And we don't know which one would be the best a priori. On LandSat data the accuracy increase with the best kernel was about 2 percent compared to 'uniform' and 'dist', which is not bad.

  2. As ‘gaussian’, ’tophat’, ’epanechnikov’, ’exponential’, ’linear’ and ’cosine’ are implemented in KernelDensity it would be only logical to implement them as weights as kNN.


Being more constructive: maybe some example which shows the accuracy increase with kernels will do?

Сan you suggest something to write about motivation of using kernel weights for kNN classification / regression?

Meanwhile I will think about that.

@GaelVaroquaux Owner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
sklearn/neighbors/base.py
((53 lines not shown))
return dist
elif callable(weights):
return weights(dist)
else:
- raise ValueError("weights not recognized: should be 'uniform', "
- "'distance', or a callable function")
+ kernel = KERNEL_WEIGHTS[weights]
+ if dist.dtype == np.ndarray: # when dist is array of arrays
@GaelVaroquaux Owner

I am really not confortable with such code. Remind me when we need to support arrays of arrays?

Unfortunately it occurs with RadiusNeighborsClassifier: the number of neighbors is not fixed and numpy can't convert an array of different size arrays to a 2-D array. I don't know what we can do with that.

@GaelVaroquaux Owner

Well, radius_neighbors public method returns exactly such structure (And what can be done about that?) I will add a comment.

@GaelVaroquaux Owner

(It was the rhetoric question.) On a positive note: I updated doc string of this method. Hope now it is more clear what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux GaelVaroquaux commented on the diff
sklearn/neighbors/base.py
((16 lines not shown))
+ Array representing the distances from points to neighbors.
+
+ ind : array of shape [n_samples, self.n_neighbors]
+ Indices of the nearest neighbors.
+
+ weights : array of shape [n_samples, self.n_neighbors]
+ Weights assigned to neighbors.
+ """
+ if self.weights in KERNEL_WEIGHTS:
+ dist, ind = self.kneighbors(X, n_neighbors=self.n_neighbors + 1)
+ bandwidth = dist[:, -1].ravel()
+ dist, ind = dist[:, :-1], ind[:, :-1]
+ weights = _get_weights(dist, self.weights, bandwidth=bandwidth)
+ else:
+ dist, ind = self.kneighbors(X)
+ weights = _get_weights(dist, self.weights)
@GaelVaroquaux Owner

Remind me: is there a check in the code path that will raise an error if the user mistyped the name of a kernel? Looking at this code path is not obvious to me.

Now it goes like this: _get_weights -> _check_weights. There wasn't a check for weights in __init__, and I didn't add it.

I don't completely understand the policy of parameters check in __init__. According to guidelines nothing should be done / checked in __init__, because of cloning and set_params. But in some classes (like KNeighborsClassifier) some checks are performed regardless.

@GaelVaroquaux Owner

OK, that's fair enough. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@GaelVaroquaux

Thank you. This is really a great contribution. I have the feeling that we are almost ready for a merge. Just a few minor things to address.

@nmayorov

Thank you for kind words! It really helps with motivation.

doc/modules/neighbors.rst
((6 lines not shown))
``weights = 'distance'`` assigns weights proportional to the inverse of the
-distance from the query point. Alternatively, a user-defined function of the
-distance can be supplied which is used to compute the weights.
-
+distance from the query point. Other allowed values for ``weights`` are
+:ref:`kernels <kernels>`: ``'tophat'``, ``'gaussian'``, ``'epanechnikov'``,
+``'exponential'``, ``'linear'``, ``'cosine'``. The bandwidth of a kernel is
+equal to the distance to the :math:`k + 1` neighbor for
+:class:`KNeighborsClassifier` and to the radius :math:`r` for
+:class:`RadiusNeighborsClassifier`. The sum of kernel-weighted votes for
+a class is proportional to the probability density for this class estimated
+with the kernel, and the class with the highest probability density is
+picked. Alternatively, a user-defined function of the distance can be supplied
+which is used to compute the weights.
@agramfort Owner

I would add a note about smoothness when using kernel here too.

thanks @nmayorov you're almost there !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@nmayorov

Hey, @dsullivan7 Could you please clarify which are correct variants:
1.

    X : array-like
        Input data.
or
    X : array-like
        The input data.

2.

    y : array-like
        Predicted values.
or
    y : array-like
        The predicted values.

Because we use "the" with singulars (i.e. "The radius of neighbors search") it seems that we also should use "the" with plurals, but I'm not sure (it sounds better without).

@dsullivan7

@nmayorov I agree with you. Exclude the "the" in this case.

@nmayorov

I added two short paragraphs about weights parameter choosing. I'm not entirely happy with them, but it is something to start with. Be so kind to review them.

Also modification of plot_regression.py is somewhat controversial (I think.)

I believe that some example comparing effectiveness of different weights in classification should also be provided. (Maybe based on existing plot_classification.py in a similar fashion with plot_regression.py?)

@coveralls

Coverage Status

Coverage increased (+0.05%) when pulling ade728f on nmayorov:neighbors_kernels into 3a02da8 on scikit-learn:master.

@nmayorov nmayorov changed the title from [WIP] Added kernel weighting functions for neighbors classes to [MRG] Added kernel weighting functions for neighbors classes
@nmayorov

About examples comparing performance of weighting schemes. I decided not to add them because of the following reasons:

  1. On toy / synthetic data sets it is misleading and deceiving. Results depend mostly on train / test split and not on actual weighting schemes.

  2. Doing it on a real data set doesn't seem to fit into docs. (There are no such boring comparisons for other methods.) Also I'm struggling to find suitable data sets among presented in scikit-learn.

Also that's the reason I removed MSE estimations (previously added by me) from plot_regression.py (They are rather meaningless.)


OK, would you guys to do the final review, @agramfort @GaelVaroquaux please.

@agramfort
Owner

I am bit lost. You posted at some point results demonstrating some benefit of these new kernels. Are you saying that none of the datasets we commonly use backup this claim?

there are some scripts which go beyond simple examples in examples/applications/

@nmayorov

You are right, it sounds confusing. I'm a bit lost myself.

I demonstrated 2% accuracy increase when using kernel weights in data set, containing 4435 train samples and 2000 test samples, This result is statistically significant I believe (and it is a real life example.). But when I experimented with iris and digits data set I found that there was no clear benefit of using weights different than uniform. Iris is too small and the accuracy mostly depends on train / test split. In digits the best results are obtained by 1 nearest neighbor, so the weights are irrelevant.

Experiments with small synthetic data sets also show that the accuracy changes significantly with different train / test splits. And I don't want to delude a user by choosing a "proper" random seed. I may continue looking into this direction though.

We need three properties for a data set: it's a classification problem, it's big enough, it's from real life. I don't think that synthetic data sets are that interesting.

Maybe I can add the fetch function for https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) ?

@agramfort
Owner
@nmayorov

Hi!

I experimented more with different weights. I have to admit that I overstated their influence. The accuracy boost was only 1% (not 2), and again it depends on train / test split. In general it gives an improvement within 1% range. So it is somewhat useful, but not very.

I think it's not worth it to add 5 new weights, because they are rather similar and give only marginal improvements.

But I think some scheme called 'distance+' might be added. It can be a linear kernel, or a scheme described in http://www.joics.com/publishedpapers/2012_9_6_1429_1436.pdf I suspect that their train / test splits weren't completely random. But no doubt the proposed scheme gives some improvement.

Please give me your opinion: should I continue this PR or create a new one with a single 'distance+' scheme? Or maybe neither of that.

@agramfort
Owner
@nmayorov

So what I've done:

  1. Removed all kernels but 'linear'. It's the most simple and theoretically sound option for weighting.

  2. Added fetch_landsat.

  3. Added example comparing accuracy on landsat for different n_neighbors.

  4. Shortened additions in rst doc

@agramfort
Owner

I have no time to look. Somebody please review.

@nmayorov

@agramfort maybe you could do it later? @GaelVaroquaux, would be great if you join.

@nmayorov

Hi!

If you think this PR is not worth including in the project, you can close it. Otherwise, I'm ready to continue working on it. I don't mind both variants.

@amueller
Owner

Sorry for the lack of feedback. This could be interesting. Do you have any relevant paper references?

@nmayorov

The main reference would be "The elements of statistical learning", chapter 6.

@amueller
Owner

The problem with using this as a reference is that it is hard to tell if people find it valuable in practice ;)

@nmayorov

The situation here is the same as with kernels for KDE. Look at different shapes of kernels. They all work very similarly, but it's impossible to choose one "best" kernel, thus let's have some variety.

Initially I wanted to add all kernels presented in KDE for consistency, because NN classification is kernel density estimation. But I noticed that they give marginal improvement over standard NN and decided to keep only triangular kernel as the most "straightforward". But it surely can give about +1% accuracy on some datasets, I added an example on landsat dataset.

@amueller
Owner

Thanks for your comments.
Sorry, we are a bit overwhelmed with PRs at the moment, and will focus on bugfixes for the upcoming release.
I'll try to look at this in more detail soon.

@nmayorov

Thanks for taking interest!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Apr 28, 2014
  1. @nmayorov
  2. @nmayorov
  3. @nmayorov

    Docstrings modification

    nmayorov authored
  4. @nmayorov
  5. @nmayorov

    Indentation fix

    nmayorov authored
  6. @nmayorov
Commits on Apr 29, 2014
  1. @nmayorov

    PEP8 violations fix

    nmayorov authored
Commits on Oct 7, 2014
  1. @nmayorov
Commits on Oct 15, 2014
  1. @nmayorov

    Formatting tiny fix

    nmayorov authored
Commits on Oct 17, 2014
  1. @nmayorov
  2. @nmayorov
  3. @nmayorov
  4. @nmayorov
Commits on Oct 21, 2014
  1. @nmayorov
  2. @nmayorov
Commits on Oct 23, 2014
  1. @nmayorov
  2. @nmayorov
  3. @nmayorov

    Modified plot_regression.py

    nmayorov authored
  4. @nmayorov

    Extended narrative docs

    nmayorov authored
  5. @nmayorov
Commits on Oct 28, 2014
  1. @nmayorov

    Minor doc fix

    nmayorov authored
  2. @nmayorov
Commits on Oct 29, 2014
  1. @nmayorov
  2. @nmayorov

    Tiny doc fix

    nmayorov authored
Commits on Oct 30, 2014
  1. @nmayorov

    Added fetch_landsat function

    nmayorov authored
Commits on Nov 3, 2014
  1. @nmayorov
  2. @nmayorov

    Modified narrative doc

    nmayorov authored
This page is out of date. Refresh to see the latest.
View
62 doc/modules/neighbors.rst
@@ -175,14 +175,16 @@ to the so-called "curse of dimensionality".
The basic nearest neighbors classification uses uniform weights: that is, the
value assigned to a query point is computed from a simple majority vote of
the nearest neighbors. Under some circumstances, it is better to weight the
-neighbors such that nearer neighbors contribute more to the fit. This can
+neighbors such that nearer neighbors contribute more to the fit. This can
be accomplished through the ``weights`` keyword. The default value,
-``weights = 'uniform'``, assigns uniform weights to each neighbor.
+``weights = 'uniform'``, assigns uniform weights to each neighbor,
``weights = 'distance'`` assigns weights proportional to the inverse of the
-distance from the query point. Alternatively, a user-defined function of the
-distance can be supplied which is used to compute the weights.
-
-
+distance from the query point, ``weights = 'linear'`` applies a linear kernel
+for neighbors weighting. In the latter case weights decay linearly with a
+distance from 1 at a distance equal zero to 0 at a distance equal to the
+kernel bandwidth. The bandwidth is equal to the distance to the :math:`k + 1`
+neighbor for :class:`KNeighborsClassifier` and to the radius :math:`r` for
+:class:`RadiusNeighborsClassifier`.
.. |classification_1| image:: ../auto_examples/neighbors/images/plot_classification_001.png
:target: ../auto_examples/neighbors/plot_classification.html
@@ -192,12 +194,29 @@ distance can be supplied which is used to compute the weights.
:target: ../auto_examples/neighbors/plot_classification.html
:scale: 50
-.. centered:: |classification_1| |classification_2|
+.. |classification_3| image:: ../auto_examples/neighbors/images/plot_classification_003.png
+ :target: ../auto_examples/neighbors/plot_classification.html
+ :scale: 50
+
+.. centered:: |classification_1| |classification_2| |classification_3|
+
+
+It is advised to try different options for ``weights`` and choose one which
+works best for your data. Setting ``weights = 'linear'`` usually gives an
+improvement of classification accuracy within 1% range, also accuracy depends
+less on number of neighbors in this case.
+
+.. figure:: ../auto_examples/neighbors/images/plot_weights_comparison_001.png
+ :target: ../auto_examples/neighbors/plot_weights_comparison.html
+ :align: center
+ :scale: 75
.. topic:: Examples:
* :ref:`example_neighbors_plot_classification.py`: an example of
classification using nearest neighbors.
+ * :ref:`example_neighbors_plot_weights_comparison.py`: an example
+ demonstrating how accuracy of classification depends on weights.
.. _regression:
@@ -220,12 +239,17 @@ The basic nearest neighbors regression uses uniform weights: that is,
each point in the local neighborhood contributes uniformly to the
classification of a query point. Under some circumstances, it can be
advantageous to weight points such that nearby points contribute more
-to the regression than faraway points. This can be accomplished through
-the ``weights`` keyword. The default value, ``weights = 'uniform'``,
-assigns equal weights to all points. ``weights = 'distance'`` assigns
-weights proportional to the inverse of the distance from the query point.
-Alternatively, a user-defined function of the distance can be supplied,
-which will be used to compute the weights.
+to the regression than faraway points. This can be accomplished through the
+``weights`` keyword. The default value, ``weights = 'uniform'``, assigns
+uniform weights to each neighbor, ``weights = 'distance'`` assigns weights
+proportional to the inverse of the distance from the query point,
+``weights = 'linear'`` applies a linear kernel for neighbors weighting. In the
+latter case weights decay linearly with a distance from 1 at a distance equal
+zero to 0 at a distance equal to the kernel bandwidth. The bandwidth is equal
+to the distance to the :math:`k + 1` neighbor for :class:`KNeighborsRegressor`
+and to the radius :math:`r` for :class:`RadiusNeighborsRegressor`.
+Alternatively, a user-defined function of the distance can be supplied, which
+will be used to compute the weights.
.. figure:: ../auto_examples/neighbors/images/plot_regression_001.png
:target: ../auto_examples/neighbors/plot_regression.html
@@ -419,12 +443,12 @@ depends on a number of factors:
a significant fraction of the total cost. If very few query points
will be required, brute force is better than a tree-based method.
-Currently, ``algorithm = 'auto'`` selects ``'kd_tree'`` if :math:`k < N/2`
-and the ``'effective_metric_'`` is in the ``'VALID_METRICS'`` list of
-``'kd_tree'``. It selects ``'ball_tree'`` if :math:`k < N/2` and the
-``'effective_metric_'`` is not in the ``'VALID_METRICS'`` list of
-``'kd_tree'``. It selects ``'brute'`` if :math:`k >= N/2`. This choice is based on the assumption that the number of query points is at least the
-same order as the number of training points, and that ``leaf_size`` is
+Currently, ``algorithm = 'auto'`` selects ``'kd_tree'`` if :math:`k < N/2`
+and the ``'effective_metric_'`` is in the ``'VALID_METRICS'`` list of
+``'kd_tree'``. It selects ``'ball_tree'`` if :math:`k < N/2` and the
+``'effective_metric_'`` is not in the ``'VALID_METRICS'`` list of
+``'kd_tree'``. It selects ``'brute'`` if :math:`k >= N/2`. This choice is based on the assumption that the number of query points is at least the
+same order as the number of training points, and that ``leaf_size`` is
close to its default value of ``30``.
Effect of ``leaf_size``
View
2  examples/neighbors/plot_classification.py
@@ -27,7 +27,7 @@
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
-for weights in ['uniform', 'distance']:
+for weights in ['uniform', 'distance', 'linear']:
# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)
View
7 examples/neighbors/plot_regression.py
@@ -34,16 +34,17 @@
# Fit regression model
n_neighbors = 5
-for i, weights in enumerate(['uniform', 'distance']):
+plt.figure(figsize=(8, 9))
+for i, weights in enumerate(['uniform', 'distance', 'linear']):
knn = neighbors.KNeighborsRegressor(n_neighbors, weights=weights)
y_ = knn.fit(X, y).predict(T)
- plt.subplot(2, 1, i + 1)
+ plt.subplot(3, 1, i + 1)
plt.scatter(X, y, c='k', label='data')
plt.plot(T, y_, c='g', label='prediction')
plt.axis('tight')
plt.legend()
plt.title("KNeighborsRegressor (k = %i, weights = '%s')" % (n_neighbors,
weights))
-
+plt.tight_layout(pad=0.5)
plt.show()
View
49 examples/neighbors/plot_weights_comparison.py
@@ -0,0 +1,49 @@
+"""
+======================================================
+Accuracy of classification with different weights
+======================================================
+
+This example demonstrates how accuracy of k-nearest-neighbors classification
+depends on number of neighbors and weights assigned to them. The real-world
+Landsat dataset is used. The train and test sets contain 3000 and 3435 samples
+respectively.
+"""
+
+
+# Author: Nikolay Mayorov <n59_ru@hotmail.com>
+# License: BSD 3 clause
+
+
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.datasets import fetch_landsat
+from sklearn.cross_validation import train_test_split
+from sklearn.neighbors import KNeighborsClassifier
+
+landsat = fetch_landsat()
+X = landsat.data
+y = landsat.target
+X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=4000,
+ random_state=0)
+
+weight_types = ['uniform', 'distance', 'linear']
+results = {w: [] for w in weight_types}
+neighbor_counts = np.arange(1, 16, 2)
+knn = KNeighborsClassifier()
+knn.fit(X_train, y_train)
+for weights in weight_types:
+ knn.set_params(weights=weights)
+ for n_neighbors in neighbor_counts:
+ knn.set_params(n_neighbors=n_neighbors)
+ results[weights].append(knn.score(X_test, y_test))
+
+for weights in weight_types:
+ plt.plot(neighbor_counts, results[weights],
+ 'o-', label="weights='{}'".format(weights))
+
+plt.xticks(neighbor_counts)
+plt.title("Accuracy of kNN on Landsat dataset")
+plt.xlabel("n_neighbors")
+plt.ylabel("accuracy")
+plt.legend(loc='lower left')
+plt.show()
View
3  sklearn/datasets/__init__.py
@@ -49,6 +49,8 @@
from .olivetti_faces import fetch_olivetti_faces
from .species_distributions import fetch_species_distributions
from .california_housing import fetch_california_housing
+from .landsat import fetch_landsat
+
__all__ = ['clear_data_home',
'dump_svmlight_file',
@@ -61,6 +63,7 @@
'fetch_species_distributions',
'fetch_california_housing',
'fetch_covtype',
+ 'fetch_landsat',
'get_data_home',
'load_boston',
'load_diabetes',
View
138 sklearn/datasets/landsat.py
@@ -0,0 +1,138 @@
+"""Landsat dataset.
+
+One of the datasets used for comparison of different classification algorithms
+in StatLog project. It contains 6435 samples with 36 dimensions. The task is to
+predict one of 6 class labels.
+
+The dataset was created from a part of the image of agricultural land in
+Australia taken by Landsat satellite. Each pixel has a class label
+corresponding to one of 6 types of terrain. And it is described by
+gray-scale values of pixels in 3x3 neighborhood measured in 4 different
+spectral bands (thus 36 features in total).
+
+The dataset is available from UCI Machine Learning Repository
+
+ https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)
+"""
+
+# Author: Nikolay Mayorov <n59_ru@hotmail.com> (based on covtype.py)
+# License: BSD 3 clause
+
+import sys
+import errno
+from io import BytesIO
+import logging
+import os
+from os.path import exists, join
+try:
+ from urllib2 import urlopen
+except ImportError:
+ from urllib.request import urlopen
+
+import numpy as np
+
+from .base import get_data_home
+from .base import Bunch
+from ..externals import joblib
+from ..utils import check_random_state
+
+
+FOLDER_URL = ("https://archive.ics.uci.edu/"
+ "ml/machine-learning-databases/statlog/satimage/")
+TRAIN_URL = FOLDER_URL + "sat.trn"
+TEST_URL = FOLDER_URL + "sat.tst"
+
+logger = logging.getLogger()
+
+
+def fetch_landsat(data_home=None, download_if_missing=True,
+ random_state=None, shuffle=False):
+ """Load Landsat dataset, downloading it if necessary.
+
+ Parameters
+ ----------
+ data_home : string, optional
+ Specify another download and cache folder for the datasets. By default
+ all scikit learn data is stored in '~/scikit_learn_data' subfolders.
+
+ download_if_missing : boolean, default=True
+ If False, raise a IOError if the data is not locally available
+ instead of trying to download the data from the source site.
+
+ random_state : int, RandomState instance or None, optional (default=None)
+ Random state for shuffling the dataset.
+ If int, random_state is the seed used by the random number generator;
+ If RandomState instance, random_state is the random number generator;
+ If None, the random number generator is the RandomState instance used
+ by `np.random`.
+
+ shuffle : bool, default=False
+ Whether to shuffle dataset.
+
+ Returns
+ -------
+ dataset : dict-like object with the following attributes:
+
+ dataset.data : array, shape (6435, 36)
+ Each row corresponds to the 36 features in the dataset.
+
+ dataset.target : array, shape (6435,)
+ Each value corresponds to one of the 6 types of terrain. These types
+ are coded by labels from [1, 2, 3, 4, 5, 7] (6 is missing).
+
+ dataset.DESCR : string
+ Description of the landsat satellite dataset.
+
+ """
+
+ data_home = get_data_home(data_home=data_home)
+ if sys.version_info[0] == 3:
+ # The zlib compression format use by joblib is not compatible when
+ # switching from Python 2 to Python 3, let us use a separate folder
+ # under Python 3:
+ dir_suffix = "-py3"
+ else:
+ # Backward compat for Python 2 users
+ dir_suffix = ""
+ landsat_dir = join(data_home, "landsat" + dir_suffix)
+ data_path = join(landsat_dir, "data")
+ targets_path = join(landsat_dir, "targets")
+ available = exists(data_path) and exists(targets_path)
+
+ if download_if_missing and not available:
+ _mkdirp(landsat_dir)
+ logger.warning("Downloading %s" % TRAIN_URL)
+ f = BytesIO(urlopen(TRAIN_URL).read())
+ Xy = np.genfromtxt(f)
+ logger.warning("Downloading %s" % TEST_URL)
+ f = BytesIO(urlopen(TEST_URL).read())
+ Xy = np.vstack((Xy, np.genfromtxt(f)))
+ X = Xy[:, :-1]
+ y = Xy[:, -1].astype(np.int32)
+ joblib.dump(X, data_path, compress=9)
+ joblib.dump(y, targets_path, compress=9)
+ try:
+ X, y
+ except NameError:
+ X = joblib.load(data_path)
+ y = joblib.load(targets_path)
+
+ if shuffle:
+ ind = np.arange(X.shape[0])
+ rng = check_random_state(random_state)
+ rng.shuffle(ind)
+ X = X[ind]
+ y = y[ind]
+
+ return Bunch(data=X, target=y, DESCR=__doc__)
+
+
+def _mkdirp(d):
+ """Ensure directory d exists (like mkdir -p on Unix)
+ No guarantee that the directory is writable.
+ """
+ try:
+ os.makedirs(d)
+ except OSError as e:
+ if e.errno != errno.EEXIST:
+ raise
View
36 sklearn/datasets/tests/test_landsat.py
@@ -0,0 +1,36 @@
+"""Test the landsat loader.
+
+Skipped if landsat is not already downloaded to data_home.
+"""
+
+import errno
+from sklearn.datasets import fetch_landsat
+from sklearn.utils.testing import assert_equal, SkipTest
+
+
+def fetch(*args, **kwargs):
+ return fetch_landsat(*args, download_if_missing=False, **kwargs)
+
+
+def test_fetch():
+ try:
+ data1 = fetch(shuffle=True, random_state=42)
+ except IOError as e:
+ if e.errno == errno.ENOENT:
+ raise SkipTest("Covertype dataset can not be loaded.")
+
+ data2 = fetch(shuffle=True, random_state=37)
+
+ X1, X2 = data1['data'], data2['data']
+ assert_equal((6435, 36), X1.shape)
+ assert_equal(X1.shape, X2.shape)
+
+ assert_equal(X1.sum(), X2.sum())
+
+ y1, y2 = data1['target'], data2['target']
+ assert_equal((X1.shape[0],), y1.shape)
+ assert_equal((X1.shape[0],), y2.shape)
+
+
+if __name__ == '__main__':
+ test_fetch()
View
173 sklearn/neighbors/base.py
@@ -42,6 +42,9 @@
brute=PAIRWISE_DISTANCE_FUNCTIONS.keys())
+VALID_WEIGHTS = ['uniform', 'distance', 'linear']
+
+
class NeighborsWarning(UserWarning):
pass
@@ -52,41 +55,59 @@ class NeighborsWarning(UserWarning):
def _check_weights(weights):
"""Check to make sure weights are valid"""
- if weights in (None, 'uniform', 'distance'):
+ if weights in VALID_WEIGHTS or weights is None:
return weights
elif callable(weights):
return weights
else:
- raise ValueError("weights not recognized: should be 'uniform', "
- "'distance', or a callable function")
+ raise ValueError("weights not recognized: should be " +
+ ", ".join(VALID_WEIGHTS) +
+ ", or a callable function")
-def _get_weights(dist, weights):
- """Get the weights from an array of distances and a parameter ``weights``
+def _get_weights(dist, weights, bandwidth=None):
+ """Get the weights from an array of distances and a parameter ``weights``.
Parameters
===========
- dist: ndarray
- The input distances
- weights: {'uniform', 'distance' or a callable}
- The kind of weighting used
+ dist: array, shape (n_samples,) or (n_samples, n_neighbors)
+ Input distances. If dist is computed by KNeighbors class then it is a
+ 2-D array of shape (n_samples, n_neighbors). If dist is computed by
+ RadiusNeighbors class then it is a 1-D array of shape (n_samples,)
+ containg 1-D arrays of sizes equal to the number of neighbors of each
+ query point within the radius.
+ weights: None, callable or string
+ The kind of weights to use. The valid string parameters are 'uniform',
+ 'distance', 'linear'.
+ bandwidth: {float, array}
+ The bandwidth for 'linear' weights. If float, then the bandwidth is the
+ same for all query points (applicable to RadiusNeighbors classes).
+ If array, then the i-th element is used as a bandwidth for the i-th
+ query (applicable to KNeighbors classes).
Returns
========
- weights_arr: array of the same shape as ``dist``
- if ``weights == 'uniform'``, then returns None
+ weights_arr: array, shape is the same as dist, or None
+ Assigned weights. If weights='uniform' or weights='tophat',
+ then None is returned.
"""
+ weights = _check_weights(weights)
if weights in (None, 'uniform'):
return None
elif weights == 'distance':
with np.errstate(divide='ignore'):
- dist = 1. / dist
+ dist = 1.0 / dist
return dist
+ elif weights == 'linear':
+ if type(bandwidth) == float: # for RadiusNeighbors
+ dist = dist / bandwidth
+ else:
+ with np.errstate(invalid='ignore'): # accounts for possible NaN
+ dist = (dist.T / bandwidth).T
+ dist = np.nan_to_num(dist)
+ return 1 - dist
elif callable(weights):
return weights(dist)
- else:
- raise ValueError("weights not recognized: should be 'uniform', "
- "'distance', or a callable function")
class NeighborsBase(six.with_metaclass(ABCMeta, BaseEstimator)):
@@ -254,24 +275,23 @@ def kneighbors(self, X, n_neighbors=None, return_distance=True):
Parameters
----------
- X : array-like, last dimension same as that of fit data
- The new point.
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Input data.
- n_neighbors : int
- Number of neighbors to get (default is the value
- passed to the constructor).
+ n_neighbors : int, optional (default=None)
+ The number of neighbors to search.
+ If None, it is set to self.n_neighbors.
return_distance : boolean, optional. Defaults to True.
If False, distances will not be returned
Returns
-------
- dist : array
- Array representing the lengths to point, only present if
- return_distance=True
+ dist : array, shape (n_samples, n_neighbors)
+ Distances to neighbors. Only presents if return_distance=True.
- ind : array
- Indices of the nearest points in the population matrix.
+ ind : array, shape (n_samples, n_neighbors)
+ Indices of neighbors.
Examples
--------
@@ -336,18 +356,18 @@ class from an array representing our data set and ask who's
def kneighbors_graph(self, X, n_neighbors=None,
mode='connectivity'):
- """Computes the (weighted) graph of k-Neighbors for points in X
+ """Compute the (weighted) graph of k-Neighbors for points in X.
Parameters
----------
- X : array-like, shape = [n_samples, n_features]
- Sample data
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Input data.
- n_neighbors : int
- Number of neighbors for each sample.
- (default is value passed to the constructor).
+ n_neighbors : int, optional (default=None)
+ The number of neighbors to search.
+ If None, it is set to self.n_neighbors.
- mode : {'connectivity', 'distance'}, optional
+ mode : {'connectivity', 'distance'}, optional (default='connectivity')
Type of returned matrix: 'connectivity' will return the
connectivity matrix with ones and zeros, in 'distance' the
edges are Euclidean distance between points.
@@ -403,6 +423,39 @@ def kneighbors_graph(self, X, n_neighbors=None,
return csr_matrix((A_data.ravel(), A_ind.ravel(), A_indptr),
shape=(n_samples1, n_samples2))
+ def _get_neighbors_and_weights(self, X):
+ """Find neighbors to X and assign weights to them according to
+ self parameters.
+
+ Parameters
+ ----------
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Input data.
+
+ Returns
+ -------
+ dist : array, shape (n_samples, self.n_neighbors)
+ Distances to neighbors.
+
+ ind : array, shape (n_samples, self.n_neighbors)
+ Indices of neighbors.
+
+ weights : array, shape (n_samples, self.n_neighbors)
+ Weights assigned to neighbors.
+ """
+ if self.n_neighbors == 1:
+ dist, ind = self.kneighbors(X)
+ weights = _get_weights(dist, None)
+ elif self.weights == 'linear':
+ dist, ind = self.kneighbors(X, n_neighbors=self.n_neighbors + 1)
+ bandwidth = dist[:, -1].ravel()
+ dist, ind = dist[:, :-1], ind[:, :-1]
+ weights = _get_weights(dist, self.weights, bandwidth=bandwidth)
+ else:
+ dist, ind = self.kneighbors(X)
+ weights = _get_weights(dist, self.weights)
@GaelVaroquaux Owner

Remind me: is there a check in the code path that will raise an error if the user mistyped the name of a kernel? Looking at this code path is not obvious to me.

Now it goes like this: _get_weights -> _check_weights. There wasn't a check for weights in __init__, and I didn't add it.

I don't completely understand the policy of parameters check in __init__. According to guidelines nothing should be done / checked in __init__, because of cloning and set_params. But in some classes (like KNeighborsClassifier) some checks are performed regardless.

@GaelVaroquaux Owner

OK, that's fair enough. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
+ return dist, ind, weights
+
class RadiusNeighborsMixin(object):
"""Mixin for radius-based neighbors searches"""
@@ -414,24 +467,26 @@ def radius_neighbors(self, X, radius=None, return_distance=True):
Parameters
----------
- X : array-like, last dimension same as that of fit data
- The new point or points
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Input data.
- radius : float
- Limiting distance of neighbors to return.
- (default is the value passed to the constructor).
+ radius : float, optional (default=None)
+ The radius of neighbors search.
+ If None, it is set to self.radius.
- return_distance : boolean, optional. Defaults to True.
+ return_distance : boolean, optional (default=True)
If False, distances will not be returned
Returns
-------
- dist : array
- Array representing the euclidean distances to each point,
- only present if return_distance=True.
+ dist : array of arrays, shape (n_samples,)
+ Distances to neighbors. It contains 1-d arrays of different
+ sizes, because the number of neighbors is not fixed.
+ Only presents if return_distance=True.
- ind : array
- Indices of the nearest points in the population matrix.
+ ind : array of arrays, shape (n_samples,)
+ Indices of neighbors. It contains 1-d arrays of different
+ sizes, because the number of neighbors is not fixed.
Examples
--------
@@ -521,21 +576,21 @@ def radius_neighbors_graph(self, X, radius=None, mode='connectivity'):
Parameters
----------
- X : array-like, shape = [n_samples, n_features]
- Sample data
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Input data.
- radius : float
- Radius of neighborhoods.
- (default is the value passed to the constructor).
+ radius : float, optional (default=None)
+ The radius of neighbors search.
+ If None, it is set to self.radius.
- mode : {'connectivity', 'distance'}, optional
+ mode : {'connectivity', 'distance'}, optional (default='connectivity')
Type of returned matrix: 'connectivity' will return the
connectivity matrix with ones and zeros, in 'distance' the
edges are Euclidean distance between points.
Returns
-------
- A : sparse matrix in CSR format, shape = [n_samples, n_samples]
+ A : sparse matrix in CSR format, shape (n_samples, n_samples)
A[i, j] is assigned the weight of edge that connects i to j.
Examples
@@ -596,11 +651,11 @@ def fit(self, X, y):
Parameters
----------
X : {array-like, sparse matrix, BallTree, KDTree}
- Training data. If array or matrix, shape = [n_samples, n_features]
+ Input data. If array-like or sparse matrix,
+ it has shape (n_samples, n_features).
- y : {array-like, sparse matrix}
- Target values, array of float values, shape = [n_samples]
- or [n_samples, n_outputs]
+ y : array-like, shape (n_samples,) or (n_samples, n_outputs).
+ Target values.
"""
if not isinstance(X, (KDTree, BallTree)):
X, y = check_X_y(X, y, "csr", multi_output=True)
@@ -615,10 +670,11 @@ def fit(self, X, y):
Parameters
----------
X : {array-like, sparse matrix, BallTree, KDTree}
- Training data. If array or matrix, shape = [n_samples, n_features]
+ Input data. If array-like or sparse matrix,
+ it has shape (n_samples, n_features).
- y : {array-like, sparse matrix}
- Target values of shape = [n_samples] or [n_samples, n_outputs]
+ y : array-like, shape (n_samples,) or (n_samples, n_outputs)
+ Target values.
"""
if not isinstance(X, (KDTree, BallTree)):
@@ -656,6 +712,7 @@ def fit(self, X, y=None):
Parameters
----------
X : {array-like, sparse matrix, BallTree, KDTree}
- Training data. If array or matrix, shape = [n_samples, n_features]
+ Input data. If array-like or sparse matrix,
+ it has shape (n_samples, n_features).
"""
return self._fit(X)
View
80 sklearn/neighbors/classification.py
@@ -12,9 +12,8 @@
from scipy import stats
from ..utils.extmath import weighted_mode
-from .base import \
- _check_weights, _get_weights, \
- NeighborsBase, KNeighborsMixin,\
+from .base import _check_weights, _get_weights, \
+ NeighborsBase, KNeighborsMixin, \
RadiusNeighborsMixin, SupervisedIntegerMixin
from ..base import ClassifierMixin
from ..utils import check_array
@@ -27,22 +26,23 @@ class KNeighborsClassifier(NeighborsBase, KNeighborsMixin,
Parameters
----------
n_neighbors : int, optional (default = 5)
- Number of neighbors to use by default for :meth:`k_neighbors` queries.
+ Number of neighbors to use by default for :meth:`kneighbors` queries.
- weights : str or callable
- weight function used in prediction. Possible values:
+ weights : string or callable, optional (default = 'uniform')
+ Weight function used in prediction. Possible values:
- - 'uniform' : uniform weights. All points in each neighborhood
+ - 'uniform' : uniform weights. All points in each neighborhood
are weighted equally.
- 'distance' : weight points by the inverse of their distance.
in this case, closer neighbors of a query point will have a
greater influence than neighbors which are further away.
+ - 'linear' : use linear kernel for weighting. Weights decay linearly
+ with a distance from 1 at the query point to 0 at the n_neighbors + 1
+ farthest point.
- [callable] : a user-defined function which accepts an
array of distances, and returns an array of the same shape
containing the weights.
- Uniform weights are used by default.
-
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
Algorithm used to compute the nearest neighbors:
@@ -72,8 +72,8 @@ class KNeighborsClassifier(NeighborsBase, KNeighborsMixin,
equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
- metric_params: dict, optional (default = None)
- additional keyword arguments for the metric function.
+ metric_params : dict, optional (default = None)
+ Additional keyword arguments for the metric function.
Examples
--------
@@ -94,6 +94,7 @@ class KNeighborsClassifier(NeighborsBase, KNeighborsMixin,
KNeighborsRegressor
RadiusNeighborsRegressor
NearestNeighbors
+ KernelDensity
Notes
-----
@@ -125,17 +126,16 @@ def predict(self, X):
Parameters
----------
- X : array of shape [n_samples, n_features]
- A 2-D array representing the test points.
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Input data.
Returns
-------
- y : array of shape [n_samples] or [n_samples, n_outputs]
- Class labels for each data sample.
+ y : array-like, shape (n_samples,) or (n_samples, n_outputs)
+ Predicted classes.
"""
X = check_array(X, accept_sparse='csr')
-
- neigh_dist, neigh_ind = self.kneighbors(X)
+ neigh_dist, neigh_ind, weights = self._get_neighbors_and_weights(X)
classes_ = self.classes_
_y = self._y
@@ -145,8 +145,6 @@ def predict(self, X):
n_outputs = len(classes_)
n_samples = X.shape[0]
- weights = _get_weights(neigh_dist, self.weights)
-
y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
for k, classes_k in enumerate(classes_):
if weights is None:
@@ -167,19 +165,18 @@ def predict_proba(self, X):
Parameters
----------
- X : array, shape = (n_samples, n_features)
- A 2-D array representing the test points.
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Input data.
Returns
-------
- p : array of shape = [n_samples, n_classes], or a list of n_outputs
- of such arrays if n_outputs > 1.
- The class probabilities of the input samples. Classes are ordered
- by lexicographic order.
+ y_prob : array, shape (n_samples, n_classes), or a list of n_outputs
+ of such arrays if n_outputs > 1.
+ Predicted probabilities for each class.
+ Classes are ordered lexicographically.
"""
X = check_array(X, accept_sparse='csr')
-
- neigh_dist, neigh_ind = self.kneighbors(X)
+ neigh_dist, neigh_ind, weights = self._get_neighbors_and_weights(X)
classes_ = self.classes_
_y = self._y
@@ -189,7 +186,6 @@ def predict_proba(self, X):
n_samples = X.shape[0]
- weights = _get_weights(neigh_dist, self.weights)
if weights is None:
weights = np.ones_like(neigh_ind)
else:
@@ -227,23 +223,24 @@ class RadiusNeighborsClassifier(NeighborsBase, RadiusNeighborsMixin,
Parameters
----------
radius : float, optional (default = 1.0)
- Range of parameter space to use by default for :meth`radius_neighbors`
+ Range of parameter space to use by default for :meth:`radius_neighbors`
queries.
- weights : str or callable
- weight function used in prediction. Possible values:
+ weights : string or callable, optional (default = 'uniform')
+ Weight function used in prediction. Possible values:
- - 'uniform' : uniform weights. All points in each neighborhood
+ - 'uniform' : uniform weights. All points in each neighborhood
are weighted equally.
- 'distance' : weight points by the inverse of their distance.
in this case, closer neighbors of a query point will have a
greater influence than neighbors which are further away.
+ - 'linear' : use linear kernel for weighting. Weights decay linearly
+ with a distance from 1 at a distance equal zero to 0 at a distance
+ equal radius.
- [callable] : a user-defined function which accepts an
array of distances, and returns an array of the same shape
containing the weights.
- Uniform weights are used by default.
-
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
Algorithm used to compute the nearest neighbors:
@@ -278,8 +275,8 @@ class RadiusNeighborsClassifier(NeighborsBase, RadiusNeighborsMixin,
neighbors on given radius).
If set to None, ValueError is raised, when outlier is detected.
- metric_params: dict, optional (default = None)
- additional keyword arguments for the metric function.
+ metric_params : dict, optional (default = None)
+ Additional keyword arguments for the metric function.
Examples
--------
@@ -298,6 +295,7 @@ class RadiusNeighborsClassifier(NeighborsBase, RadiusNeighborsMixin,
RadiusNeighborsRegressor
KNeighborsRegressor
NearestNeighbors
+ KernelDensity
Notes
-----
@@ -323,13 +321,13 @@ def predict(self, X):
Parameters
----------
- X : array of shape [n_samples, n_features]
- A 2-D array representing the test points.
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Input data.
Returns
-------
- y : array of shape [n_samples] or [n_samples, n_outputs]
- Class labels for each data sample.
+ y : array, shape (n_samples,) or (n_samples, n_outputs)
+ Predicted classes.
"""
X = check_array(X, accept_sparse='csr')
@@ -355,7 +353,7 @@ def predict(self, X):
'or consider removing them from your dataset.'
% outliers)
- weights = _get_weights(neigh_dist, self.weights)
+ weights = _get_weights(neigh_dist, self.weights, bandwidth=self.radius)
y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype)
for k, classes_k in enumerate(classes_):
View
58 sklearn/neighbors/regression.py
@@ -27,22 +27,23 @@ class KNeighborsRegressor(NeighborsBase, KNeighborsMixin,
Parameters
----------
n_neighbors : int, optional (default = 5)
- Number of neighbors to use by default for :meth:`k_neighbors` queries.
+ Number of neighbors to use by default for :meth:`kneighbors` queries.
- weights : str or callable
+ weights : string or callable, optional (default = 'uniform')
weight function used in prediction. Possible values:
- - 'uniform' : uniform weights. All points in each neighborhood
+ - 'uniform' : uniform weights. All points in each neighborhood
are weighted equally.
- 'distance' : weight points by the inverse of their distance.
in this case, closer neighbors of a query point will have a
greater influence than neighbors which are further away.
+ - 'linear' : use linear kernel for weighting. Weights decay linearly
+ with a distance from 1 at the query point to 0 at the n_neighbors + 1
+ farthest point.
- [callable] : a user-defined function which accepts an
array of distances, and returns an array of the same shape
containing the weights.
- Uniform weights are used by default.
-
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
Algorithm used to compute the nearest neighbors:
@@ -72,8 +73,8 @@ class KNeighborsRegressor(NeighborsBase, KNeighborsMixin,
equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
- metric_params: dict, optional (default = None)
- additional keyword arguments for the metric function.
+ metric_params : dict, optional (default = None)
+ Additional keyword arguments for the metric function.
Examples
--------
@@ -92,6 +93,7 @@ class KNeighborsRegressor(NeighborsBase, KNeighborsMixin,
RadiusNeighborsRegressor
KNeighborsClassifier
RadiusNeighborsClassifier
+ KernelDensity
Notes
-----
@@ -118,23 +120,20 @@ def __init__(self, n_neighbors=5, weights='uniform',
self.weights = _check_weights(weights)
def predict(self, X):
- """Predict the target for the provided data
+ """Predict the target for the provided data.
Parameters
----------
- X : array or matrix, shape = [n_samples, n_features]
-
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Input data.
Returns
-------
- y : array of int, shape = [n_samples] or [n_samples, n_outputs]
- Target values
+ y : array, shape (n_samples,) or (n_samples, n_outputs)
+ Predicted values.
"""
X = check_array(X, accept_sparse='csr')
-
- neigh_dist, neigh_ind = self.kneighbors(X)
-
- weights = _get_weights(neigh_dist, self.weights)
+ neigh_dist, neigh_ind, weights = self._get_neighbors_and_weights(X)
_y = self._y
if _y.ndim == 1:
@@ -167,23 +166,24 @@ class RadiusNeighborsRegressor(NeighborsBase, RadiusNeighborsMixin,
Parameters
----------
radius : float, optional (default = 1.0)
- Range of parameter space to use by default for :meth`radius_neighbors`
+ Range of parameter space to use by default for :meth:`radius_neighbors`
queries.
- weights : str or callable
- weight function used in prediction. Possible values:
+ weights : string or callable, optional (default = 'uniform')
+ Weight function used in prediction. Possible values:
- - 'uniform' : uniform weights. All points in each neighborhood
+ - 'uniform' : uniform weights. All points in each neighborhood
are weighted equally.
- 'distance' : weight points by the inverse of their distance.
in this case, closer neighbors of a query point will have a
greater influence than neighbors which are further away.
+ - 'linear' : use linear kernel for weighting. Weights decay linearly
+ with a distance from 1 at a distance equal zero to 0 at a distance
+ equal radius.
- [callable] : a user-defined function which accepts an
array of distances, and returns an array of the same shape
containing the weights.
- Uniform weights are used by default.
-
algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, optional
Algorithm used to compute the nearest neighbors:
@@ -213,8 +213,8 @@ class RadiusNeighborsRegressor(NeighborsBase, RadiusNeighborsMixin,
equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
- metric_params: dict, optional (default = None)
- additional keyword arguments for the metric function.
+ metric_params : dict, optional (default = None)
+ Additional keyword arguments for the metric function.
Examples
--------
@@ -233,6 +233,7 @@ class RadiusNeighborsRegressor(NeighborsBase, RadiusNeighborsMixin,
KNeighborsRegressor
KNeighborsClassifier
RadiusNeighborsClassifier
+ KernelDensity
Notes
-----
@@ -257,18 +258,19 @@ def predict(self, X):
Parameters
----------
- X : array or matrix, shape = [n_samples, n_features]
+ X : {array-like, sparse matrix}, shape (n_samples, n_features)
+ Input data.
Returns
-------
- y : array of int, shape = [n_samples] or [n_samples, n_outputs]
- Target values
+ y : array, shape (n_samples,) or (n_samples, n_outputs)
+ Predicted values.
"""
X = check_array(X, accept_sparse='csr')
neigh_dist, neigh_ind = self.radius_neighbors(X)
- weights = _get_weights(neigh_dist, self.weights)
+ weights = _get_weights(neigh_dist, self.weights, bandwidth=self.radius)
_y = self._y
if _y.ndim == 1:
View
8 sklearn/neighbors/tests/test_kde.py
@@ -106,10 +106,10 @@ def test_kde_algorithm_metric_choice():
def test_kde_score(n_samples=100, n_features=3):
pass
- #FIXME
- #np.random.seed(0)
- #X = np.random.random((n_samples, n_features))
- #Y = np.random.random((n_samples, n_features))
+ # FIXME
+ # np.random.seed(0)
+ # X = np.random.random((n_samples, n_features))
+ # Y = np.random.random((n_samples, n_features))
def test_kde_badargs():
View
62 sklearn/neighbors/tests/test_neighbors.py
@@ -48,6 +48,9 @@ def _weight_func(dist):
return retval ** 2
+WEIGHTS = ['uniform', 'distance', 'linear', _weight_func]
+
+
def test_unsupervised_kneighbors(n_samples=20, n_features=5,
n_query_pts=2, n_neighbors=5):
"""Test unsupervised neighbors methods"""
@@ -146,10 +149,8 @@ def test_kneighbors_classifier(n_samples=40,
y = ((X ** 2).sum(axis=1) < .5).astype(np.int)
y_str = y.astype(str)
- weight_func = _weight_func
-
for algorithm in ALGORITHMS:
- for weights in ['uniform', 'distance', weight_func]:
+ for weights in WEIGHTS:
knn = neighbors.KNeighborsClassifier(n_neighbors=n_neighbors,
weights=weights,
algorithm=algorithm)
@@ -221,10 +222,8 @@ def test_radius_neighbors_classifier(n_samples=40,
y = ((X ** 2).sum(axis=1) < .5).astype(np.int)
y_str = y.astype(str)
- weight_func = _weight_func
-
for algorithm in ALGORITHMS:
- for weights in ['uniform', 'distance', weight_func]:
+ for weights in WEIGHTS:
neigh = neighbors.RadiusNeighborsClassifier(radius=radius,
weights=weights,
algorithm=algorithm)
@@ -248,11 +247,9 @@ def test_radius_neighbors_classifier_when_no_neighbors():
z1 = np.array([[1.01, 1.01], [2.01, 2.01]]) # no outliers
z2 = np.array([[1.01, 1.01], [1.4, 1.4]]) # one outlier
- weight_func = _weight_func
-
for outlier_label in [0, -1, None]:
for algorithm in ALGORITHMS:
- for weights in ['uniform', 'distance', weight_func]:
+ for weights in WEIGHTS:
rnc = neighbors.RadiusNeighborsClassifier
clf = rnc(radius=radius, weights=weights, algorithm=algorithm,
outlier_label=outlier_label)
@@ -279,10 +276,8 @@ def test_radius_neighbors_classifier_outlier_labeling():
correct_labels1 = np.array([1, 2])
correct_labels2 = np.array([1, -1])
- weight_func = _weight_func
-
for algorithm in ALGORITHMS:
- for weights in ['uniform', 'distance', weight_func]:
+ for weights in WEIGHTS:
clf = neighbors.RadiusNeighborsClassifier(radius=radius,
weights=weights,
algorithm=algorithm,
@@ -302,10 +297,8 @@ def test_radius_neighbors_classifier_zero_distance():
z1 = np.array([[1.01, 1.01], [2.0, 2.0]])
correct_labels1 = np.array([1, 2])
- weight_func = _weight_func
-
for algorithm in ALGORITHMS:
- for weights in ['uniform', 'distance', weight_func]:
+ for weights in WEIGHTS:
clf = neighbors.RadiusNeighborsClassifier(radius=radius,
weights=weights,
algorithm=algorithm)
@@ -313,6 +306,19 @@ def test_radius_neighbors_classifier_zero_distance():
assert_array_equal(correct_labels1, clf.predict(z1))
+def test_kneighbors_classifier_zero_distance():
+ X = np.array([[1.0, 1.0], [1.0, 1.0], [1.0, 1.0], [1.0, 1.0]])
+ y = np.array([0, 1, 0, 0])
+ X_test = np.array([[1.0, 1.0], [2.0, 2.0]])
+ y_test = np.array([0, 0])
+ for algorithm, weights in zip(ALGORITHMS, ['linear']):
+ clf = neighbors.KNeighborsClassifier(n_neighbors=3,
+ weights=weights,
+ algorithm=algorithm)
+ clf.fit(X, y)
+ assert_array_equal(y_test, clf.predict(X_test))
+
+
def test_RadiusNeighborsClassifier_multioutput():
"""Test k-NN classifier on multioutput data"""
rng = check_random_state(0)
@@ -325,9 +331,7 @@ def test_RadiusNeighborsClassifier_multioutput():
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
- weights = [None, 'uniform', 'distance', _weight_func]
-
- for algorithm, weights in product(ALGORITHMS, weights):
+ for algorithm, weights in product(ALGORITHMS, WEIGHTS):
# Stack single output prediction
y_pred_so = []
for o in range(n_output):
@@ -384,9 +388,7 @@ def test_KNeighborsClassifier_multioutput():
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
- weights = [None, 'uniform', 'distance', _weight_func]
-
- for algorithm, weights in product(ALGORITHMS, weights):
+ for algorithm, weights in product(ALGORITHMS, WEIGHTS):
# Stack single output prediction
y_pred_so = []
y_pred_proba_so = []
@@ -428,13 +430,10 @@ def test_kneighbors_regressor(n_samples=40,
X = 2 * rng.rand(n_samples, n_features) - 1
y = np.sqrt((X ** 2).sum(1))
y /= y.max()
-
y_target = y[:n_test_pts]
- weight_func = _weight_func
-
for algorithm in ALGORITHMS:
- for weights in ['uniform', 'distance', weight_func]:
+ for weights in WEIGHTS:
knn = neighbors.KNeighborsRegressor(n_neighbors=n_neighbors,
weights=weights,
algorithm=algorithm)
@@ -482,11 +481,9 @@ def test_kneighbors_regressor_multioutput(n_samples=40,
y = np.sqrt((X ** 2).sum(1))
y /= y.max()
y = np.vstack([y, y]).T
-
y_target = y[:n_test_pts]
- weights = ['uniform', 'distance', _weight_func]
- for algorithm, weights in product(ALGORITHMS, weights):
+ for algorithm, weights in product(ALGORITHMS, WEIGHTS):
knn = neighbors.KNeighborsRegressor(n_neighbors=n_neighbors,
weights=weights,
algorithm=algorithm)
@@ -508,13 +505,10 @@ def test_radius_neighbors_regressor(n_samples=40,
X = 2 * rng.rand(n_samples, n_features) - 1
y = np.sqrt((X ** 2).sum(1))
y /= y.max()
-
y_target = y[:n_test_pts]
- weight_func = _weight_func
-
for algorithm in ALGORITHMS:
- for weights in ['uniform', 'distance', weight_func]:
+ for weights in WEIGHTS:
neigh = neighbors.RadiusNeighborsRegressor(radius=radius,
weights=weights,
algorithm=algorithm)
@@ -565,11 +559,9 @@ def test_RadiusNeighborsRegressor_multioutput(n_samples=40,
y = np.sqrt((X ** 2).sum(1))
y /= y.max()
y = np.vstack([y, y]).T
-
y_target = y[:n_test_pts]
- weights = ['uniform', 'distance', _weight_func]
- for algorithm, weights in product(ALGORITHMS, weights):
+ for algorithm, weights in product(ALGORITHMS, WEIGHTS):
rnn = neighbors.RadiusNeighborsRegressor(n_neighbors=n_neighbors,
weights=weights,
algorithm=algorithm)
View
5 sklearn/neighbors/unsupervised.py
@@ -43,8 +43,8 @@ class NearestNeighbors(NeighborsBase, KNeighborsMixin,
equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
- metric_params: dict, optional (default = None)
- additional keyword arguments for the metric function.
+ metric_params : dict, optional (default = None)
+ Additional keyword arguments for the metric function.
Examples
--------
@@ -59,6 +59,7 @@ class NearestNeighbors(NeighborsBase, KNeighborsMixin,
... #doctest: +ELLIPSIS
array([[2, 0]]...)
+
>>> neigh.radius_neighbors([0, 0, 1.3], 0.4, return_distance=False)
array([[2]])
Something went wrong with that request. Please try again.